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Abstract 

We study the problem of recursively recovering a time sequence of sparse vectors, St, from measurements Mt := St + L t 
CO . 

that are corrupted by structured noise L t which is dense and can have large magnitude. The structure that we require is that L t 
should lie in a low dimensional subspace that is either fixed or changes "slowly enough"; and the eigenvalues of its covariance 
matrix are "clustered". We do not assume any model on the sequence of sparse vectors. Their support sets and their nonzero 
element values may be either independent or correlated over time (usually in many applications they are correlated). The only 
thing required is that there be some support change every so often. We introduce a novel solution approach called Recursive 
Projected Compressive Sensing with cluster-PCA (ReProCS-cPCA) that addresses some of the limitations of earlier work. Under 
■ mild assumptions, we show that, with high probability, ReProCS-cPCA can exactly recover the support set of St at all times; and 

the reconstruction errors of both St and L t are upper bounded by a time-invariant and small value. 
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I. Introduction 

In this work, we study the problem of recursively recovering a time sequence of sparse vectors, St, from measurements 
^-j. , Mt := St + L t that are corrupted by structured noise L t which is dense and can have large magnitude. The structure that we 
require is that L t should lie in a low dimensional subspace that is either fixed or changes "slowly enough" as discussed in 
\ Sec III-Bt and the eigenvalues of its covariance matrix are "clustered" as explained in Sec IH-DI As a by-product, at certain 
£f~) | times, the basis vectors for the subspace in which the most recent several L t 's lies is also recovered. Thus, at these times, we 
also solve the recursive robust principal components' analysis (PCA) problem. For the recursive robust PCA problem, L t is 
the signal of interest while St can be interpreted as the outlier (large but sparse noise). 

A key application where the above problem occurs is in video analysis where the goal is to separate a slowly changing 
' background from moving foreground objects [1], 02]. If one stacks each image frame as a column vector, the background 
is well modeled as lying in a low dimensional subspace that may gradually change over time, while the moving foreground 
objects constitute the sparse vectors 0, which change in a correlated fashion over time. Another key application is online 
detection of brain activation patterns from functional MRI (fMRJ) sequences. In this case, the "active" region of the brain is 
the correlated sparse vector. 
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A. Related Work 

Many of the older works on sparse recovery with structured noise study the case of sparse recovery from large but sparse 
noise (outliers), e.g., 0, 0, 0. However, here we are interested in sparse recovery in large but low dimensional noise. On 
the other hand, most older works on robust PCA cannot recover the outlier (St) when its nonzero entries have magnitude much 
smaller than that of the low dimensional part (L t ) 0, (TJ, 0. The main goal of this work is to study sparse recovery and 
hence we do not discuss these older works here. Some recent works on robust PCA such as 0, assume that an entire 
measurement vector M t is either an inlier (St is a zero vector) or an outlier (all entries of St can be nonzero), and a certain 
number of M t 's are inliers. These works also cannot be used when all St's are nonzero but sparse. 

C. Qiu and N. Vaswani are with the ECE dept at Iowa State University. Email: {chenlu, namrata} @iastate.edu. This work was supported by NSF grant 
CCF-1 117125. A shorter version of this work is submitted to ISIT 2013. 
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In a series of recent works |]2], IflOl . a new and elegant solution, which is referred to as Principal Components' Pursuit (PCP) 
in 12, has been proposed. It redefines batch robust PCA as a problem of separating a low rank matrix, Ct := [Li, . . . ,L t ], 
from a sparse matrix, St := [Si, . . . , St], using the measurement matrix, M. t := [Mi, . . . , M t ] = Ct + St. Thus these works 
can be interpreted as batch solutions to sparse recovery in large but low dimensional noise. Other recent works that also study 
batch algorithms for recovering a sparse St and a low -rank Ct from KA t '■= Ct + St or from undersampled measurements 

include flU, El, El, E), US), GH, El, El, El, El. 

It was shown in |2| that, with high probability (w.h.p.), one can recover Ct and St exactly by solving 

mm||£||* + A||<S||i,vec subject to C + S = Mt (1) 

jC ,s 

provided that (a) Ct is dense (its left and right singular vectors satisfy certain conditions); (b) any element of the matrix St is 
nonzero w.p. g, and zero w.p. 1 — g, independent of all others (in particular, this means that the support sets of the different 
St's are independent over time); and (c) the rank of Ct and the support size of St are small enough. Here ||-B||* is the nuclear 
norm of B (sum of singular values of B) while ||-B||i )V ec is the ^i norm of B seen as a long vector. In most applications, 
it is fair to assume that the low dimensional part, L t (background in case of video) is dense. However, the assumption that 
the support of the sparse part (foreground in case of video) is independent over time is often not valid. Foreground objects 
typically move in a correlated fashion, and may even not move for a few frames. This results in St being sparse and low rank. 

The question then is, what can we do if Ct is low rank and dense, but St is sparse and may also be low rank? In this 
case, without any extra information, in general, it is not possible to separate St and Ct- In |2"T1 . we introduced the Recursive 
Projected Compressive Sensing (ReProCS) algorithm that provided one possible solution to this problem by using the extra 
piece of information that an initial short sequence of Lt's, or Lt's in small noise, is available (which can be used to get an 
accurate estimate of the subspace in which the initial L t 's lie) and assuming slow subspace change (as explained in Sec. III-Bb . 
The key idea of ReProCS is as follows. At time t, assume that a n x r matrix with orthonormal columns, Pn-i)> is available 
with span(P( t _i)) « span(£t_i). We project Alt perpendicular to span(P( t _ 1 )). Because of slow subspace change, this cancels 
out most of the contribution of L t . Recovering St from the projected measurements then becomes a classical sparse recovery 
/ compressive sensing (CS) problem in small noise 0221 . Under a denseness assumption on span(£ t _i), one can show that St 
can be accurately recovered via li minimization. Thus, L t = M t — S t can also be recovered accurately. We use the estimates 
of L t in a projection-PCA based subspace estimation algorithm to update Pm . 

ReProCS is designed under the assumption that the subspace in which the most recent several Lt's lie can only grow 
over time. It assumes a model in which at every subspace change time, tj, some new directions get added to this subspace. 
After every subspace change, it uses projection-PCA to estimate the newly added subspace. As a result the rank of Pm 
keeps increasing with every subspace change. Therefore, the number of effective measurements available for the CS step, 
(n — rank(P( t „ 1 ))), keeps reducing. To keep this number large enough at all times, ReProCS needs to assume a bound on the 
total number of subspace changes, J. 

B. Our Contributions and More Related Work 

In practice, usually, the dimension of the subspace in which the most recent several L t 's lie typically remains roughly 
constant. A simple way to model this is to assume that at every change time, tj, some new directions can get added and 
some existing directions can get deleted from this subspace and to assume an upper bound on the difference between the total 
number of added and deleted directions (the earlier model in 12D is a special case of this). ReProCS still applies for this more 
general model as discussed in the extensions section of l2T1 . However, because it never deletes directions, the rank of Pm 
still keeps increasing with every subspace change time and so it still requires a bound on J. 

In this work, we address the above limitation by introducing a novel approach called cluster-PCA that re-estimates the 
current subspace after the newly added directions have been accurately estimated. This re-estimation step ensures that the 
deleted directions have been "removed" from the new Pm, We refer to the resulting algorithm as ReProCS-cPCA. The design 
and analysis of cluster-PCA and ReProCS-cPCA is the focus of the current paper. We will see that ReProCS-cPCA does not 
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need a bound on J as long as the delay between subspace change times increases in proportion to log J. An extra assumption 
that is needed though is that the eigenvalues of the covariance matrix of L t are sufficiently clustered at certain times as 
explained in Sec III-DI As discussed in Sec IIV-B1 this is a practically valid assumption. 

Under the clustering assumption and some other mild assumptions, we show that, w.h.p, at all times, ReProCS-cPCA can 
exactly recover the support of St, and the reconstruction errors of both St and L t are upper bounded by a time invariant and 
small value. Moreover, we show that the subspace recovery error decays roughly exponentially with every projection-PCA step. 
The proof techniques developed in this work are very different from those used to obtain performance guarantees in recent 
batch robust PCA works such as (2, (TO), E3, 0, 0, fll], E2, ED, ED, G3, HI, ED- As explained earlier, 0, 
also study a different problem. Our proof utilizes sparse recovery results ll22l : results from matrix perturbation theory (sin 9 
theorem E4l and Weyl's theorem [25 1) and the matrix Hoeffding inequality [26|. 

Our result for ReProCS-cPCA (and also that for ReProCS from [21]) does not assume any model on the sparse vectors', 
St's. In particular, it allows the support sets of the St's, to be either independent, e.g. generated via the model of [2| (resulting 
in St being full rank w.h.p.), or correlated over time (can result in St being low rank). As explained in Sec IIV-BI the only 
thing that is required is that there be some support changes every so often. We should point out that some of the other works 
that study the batch problem, e.g. 0, ||T6l , also allow St to be low rank. 

A key difference of our work compared with most existing work analyzing finite sample PCA, e.g. ll27ll . and references 
therein, is that in these works, the noise/error in the observed data is independent of the true (noise-free) data. However, in 
our case, because of how L t is computed, the error e t — L t — L t is correlated with L t . As a result the tools developed in 
these earlier works cannot be used for our problem. This is the main reason we need to develop and analyze projection-PCA 
based approaches for both subspace addition and deletion. 

In earlier conference papers [28 1, [29], we first introduced the ReProCS idea. However, these used an algorithm motivated 
by recursive PCA ll30l for updating the subspace estimates on-the-fly. As explained in Sec Hill and also in ll2~Tl Appendix F], 
it is not clear how to obtain performance guarantees for recursive PCA (which is a fast algorithm for PCA) for our problem. 
Another online algorithm that addresses a problem similar to ours is given in ||3T| . This also does not obtain guarantees. 

The ReProCS-cPCA approach is related to that of 1132] , ll33l , ll34l in that all of these first try to nullify the low dimensional 
signal by projecting the measurement vector into a subspace perpendicular to that of the low dimensional signal, and then 
solve for the sparse "error" vector. However, the big difference is that in all of these works the basis for the subspace of the 
low dimensional signal is perfectly known. We study the case where the subspace is not known and can change over time. 

C. Paper Organization 

We give the notation next followed by a review of results from existing work that we will need. The problem definition and 
the three key assumptions that are needed are explained in Sec [TTJ We develop the ReProCS-cPCA algorithm in Sec [TTTJ We 
give its performance guarantees (Theorem I4.lt in Sec [IV] Here we also provide a discussion of the result and the assumptions 
it makes. We define the quantities needed for the proof and give the proof outline in Sec |V] The proof of Theorem 14.11 is 
given in Sec [VI] The key lemmas needed for it are given and proved in Sec IVIII In Sec IVIII1 we show numerical experiments 
demonstrating Theorem 14.11 as well as comparisons with ReProCS and PCP Conclusions are given in Sec IIXI 

D. Notation 

For a set T C {1, 2, . . . n}, we use \T\ to denote its cardinality, i.e., the number of elements in T. We use T c to denote its 
complement w.r.t. {1,2,... n}, i.e. T c := {i e {1, 2, . . . n} : i T}. The notations Ti C T 2 and T 2 D Ti both mean that Ti 
is a subset of T%. 

We use the notation [ti,t 2 ] to denote an interval which contains t\ and t 2 , as well as all integers between them, i.e. 
[ti,t2] ■— {ti,ti + 1, ■ ■ ■ ,t 2 }. The notation [L t ;t E [ti,t 2 ]] is used to denote the matrix [L tl , L tl +i, ■ ■ ■ , L t2 ]. 
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For a vector v, m denotes the ith entry of v and vt denotes a vector consisting of the entries of v indexed by T. We use 
||w|| p to denote the l v norm of v. The support of v, supp(i>), is the set of indices at which v is nonzero, supp(u) := {i : Vi 7^ 0}. 
We say that v is s-sparse if |supp(u)| < s. 

For a tall matrix P, span(P) denotes the subspace spanned by the column vectors of P. 

For a matrix B, B' denotes its transpose, and B^ denotes its pseudo-inverse. For a matrix with linearly independent columns, 
B^ = (B'B)~ 1 B'. We use ||-B||2 := max^o ||-Bx||2/||x||2 to denote the induced 2-norm of the matrix. Also, ||-B||* is the 
nuclear norm and ||B|| max denotes the maximum over the absolute values of all its entries. We let <Ji(B) denote the ith largest 
singular value of B. For a Hermitian matrix, B, we use the notation B =° UAU' to denote the eigenvalue decomposition 
(EVD) of B. Here U is an orthonormal matrix and A is a diagonal matrix with entries arranged in non-increasing order. Also, 
we use Xi(B) to denote the ith largest eigenvalue of a Hermitian matrix B and we use A ma x(-E>) and A m i n (£?) denote its 
maximum and minimum eigenvalues. If B is Hermitian positive semi-definite (p.s.d.), then Ai(B) = <Ji(B). For Hermitian 
matrices B\ and B 2 , the notation B\ < B2 means that B2 — B\ is p.s.d. Similarly, B\ >; B2 means that B\ — B2 is p.s.d. 

For a Hermitian matrix B, we have ||£?|| 2 = ■\/max(A 1 2 nax (i?), A^ in (i?)). Thus, for a b > 0, ||-B|| 2 < b implies that 
—b < X m in{B) < A m ax(-B) < b. If B is a Hermitian p.s.d. matrix, then ||B||2 = A max (-B). 

The notation [.] denotes an empty matrix. We use / to denote an identity matrix. For an m x n matrix B and an index set 
T C {1, 2, . . . n), Bt is the sub-matrix of B containing columns with indices in the set T. Notice that Bt = BIt- We use 
B \ Bt to denote Bt? (here T c := {i 6 {1, 2, • • • ,n} : i ^ T}). Given another matrix B2 of size m x ri2, [B B2] constructs 
a new matrix by concatenating matrices B and B2 in horizontal direction. Thus, \(B \ Bt) B2] = [Bt? B2]. For any matrix 
B and sets T\,T2, {B)tx,t 2 denotes the sub-matrix containing the rows with indices in T\ and columns with indices in T%. 

Definition 1.1: We refer to a tall matrix P as a basis matrix if it satisfies P'P = I. 

Definition 1.2: The s-restricted isometry constant (RIC) ll32l . 8 S , for annxm matrix ^ is the smallest real number satisfying 
(1 - <5 5 )||x||! < H^T^IIi < (1 + ^)11^111 for a11 sets T C {1,2,... ra} with |T| < s and all real vectors x of length \T\. 
It is easy to see that max T:m < s || (^t'^t)- 1 || 2 < 

Definition 1.3: Let X and Z be two random variables (r.v.) and let B be a set of values that Z can take. 

1) We use B e to denote the event Z E B, i.e. B e := {Z e B}. 

2) The probability of event B e can be expressed as [35 1, 



I otherwise 

is an indicator function of Z on the set B and E[Ig(Z)] is the expectation of I^(Z). 
3) Define P(B e \X) := E[I B (Z)\X] where E[I B (Z)|A] is the conditional expectation of I B (Z) given X. 
Finally, RHS refers to the right hand side of an equation or inequality; w.p. means "with probability"; and w.h.p. means 
"with high probability". 

E. Preliminaries 

In this section we state certain results from literature, or certain lemmas which follow easily using these results, that will 
be used in proving our main result. 

1 ) Simple probability facts and matrix Hoeffding inequalities: The following result follows directly from Definition 11.31 

Lemma 1.4: Suppose that B is the set of values that the r.v.s X, Y can take. Suppose that C is a set of values that the r.v. 
X can take. For a < p < 1, if P(B e \X) > p for all X £ C, then P(B e \C e ) > p as long as P(C e ) > 0. 
Proof: This is the same as ||2T1 Lemma 11]. 

The following lemma is an easy consequence of the chain rule of probability applied to a contracting sequence of events. 



P{B e ) 



E[l B (Z)}. 



where 
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Lemma 1.5: For a sequence of events Eq, Ef, . . . Ef n that satisfy Eq D Ef D E% ■ ■ ■ D E^ n , the following holds 

m 

P(E? n \E*) = l[P(El\Et_ 1 ). 



fe=i 



Proof: P(E e m \ES) = P(^,^_ X) . . . E^) = ^=1 P(^l^-i,^-2. ■ ■ ■ E E) = Uk=i P(^|££_i)- ■ 
The following two results are corollaries of the matrix Hoeffding inequality [26, Theorem 1.3] that were proved in lED . In 

the rest of the paper we often refer to them as the Hoeffding corollaries. 

Corollary 1.6 (Matrix Hoeffding conditioned on another random variable for a nonzero mean Hermitian matrix): Given 

an a-length sequence {Z t } of random Hermitian matrices of size n x n, a r.v. X, and a set C of values that X can take. 

Assume that, for all X E C, (i) Zt's are conditionally independent given X; (ii) P(bxl ~< Z t < b 2 I\X) = 1 and (iii) 

b 3 I < ±J2t E ( z t\ x ) ^ hi- Then for all e > 0, 

P(A max (- Zt ) ^ 64 + e ' X ) ^ 1 - nex P(- o7T^-i: n 2 ) for all ^ G C 



8(6 2 -6!) 2 ' 



P(A min (- Vz t ) > b 3 -e\X) > l-nexp(- 



8(6 2 -foi) : 



r ) for all X € C 



Proof: This is slight modification of lETl Corollary 13]. 

Corollary 1.7 (Matrix Hoeffding conditioned on another random variable for an arbitrary nonzero mean matrix): Given 
an a-length sequence {Z t } of random Hermitian matrices of size n x n, a r.v. X, and a set C of values that X can 
take. Assume that, for all X g C, (i) Zt's are conditionally independent given X; (ii) P(||Z t ||2 < bi\X) = 1 and (iii) 
||£E, E(Z t |X)|| 2 < 63. Then, for all e > 0, 

P(ll i X! Z *H 2 - 62 + e ' X ) > 1 ~( n i + "2) CX P(-^) for all X e C 

Proof: This is slight modification of lETl Corollary 14]. 

2) Linear algebra results: Kahan and Davis's sin# theorem [24] studies the effect of a Hermitian perturbation, W, on a 
Hermitian matrix, A. 

Theorem 1.8 (sin# theorem [24]): Given two Hermitian matrices A and H satisfying 



A = 



EE, 



'a 




' E' ' 


Aj_ 







u 



E E 1 



H B' ' 




' E' ' 


B H±_ 




E ^. 



(2) 



where [E E±] is an orthonormal matrix. The two ways of representing A + 'H are 

A + H B' 



A + H 



EEa 





' E' ' 








A " 




' F' ' 








FF X 
















Aj_ 





B A ± +H ± _ 

where [F Fj_] is another orthonormal matrix. Let Tl ;= (A + H)E - AE = HE. If A min (A) > A max (Aj_), then 

- FF')E\\ 2 < 

Amin(-A) — A ma x(Aj_) 

Next we state the Weyl's theorem (Weyl's inequality for matrices) 11251 page 181] and the Ostrowski's theorem [25, page 
224]. 

Theorem 1.9 (Weyl [25]): Let A and H be two n x n Hermitian matrices. For each i = 1, 2, . . . , n we have 

Xi(A) + X min (H) < Xi(A + H) < Xi(A) + X max (H) 

Theorem 1.10 (Ostrowski R25\l ): Let H and W be n x n matrices, with H Hermitian and W nonsingular. For each i = 
1,2... n, there exists a positive real number Q l such that \ m in{WW') < 9 t < X max (WW') and Xi(WHW') = 9iXi(H). 
Therefore, 

X min (WHW') > X min (WW')X min {H) 



The following lemma uses the sin 6* theorem and Weyl's theorem. It generalizes the idea of lETl Lemma 30]. 
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Lemma 1.11: Suppose that two Hermitian matrices A and T~L can be decomposed as in (O where [E E±] is an orthonormal 
matrix and A is a c x c matrix. Also, suppose that the EVD of A + T~L is 



A + n EV = D 



F F± 



"a " 




" F' ' 


Aj_ 







where A is a c x c diagonal matrix. If X m i n (A) > A max (j4j_) + ||H||2, then 



\(I-FF')E\\ 2 < 



X min (A) - X ma *(A ± ) - \\H\\ 2 
Proof: By definition of EVD, [F F±] is an orthonormal matrix. By the sin 9 theorem, if A m j n (A) > A max (Aj_), then 
- FF')E\\ 2 < >mm(J S B(Al) where K := HE. Clearly, \\TZ\\ 2 < \\H\\ 2 . Since X min (A) > \ max (A ± ) and A is a c x c 
matrix, thus, A c+ i(y4) = X max (A±). 

By definition of EVD (eigenvalues arranged in non-increasing order) and since A is a cx c matrix, X c+ i(A+'H) = A max (Aj_). 
By Weyl's theorem, A max (A ± ) = X C+1 (A + U)< X C+1 (A) + A max (ft). Since X max (U) < \\U\\ 2 , the result follows. ■ 
The following lemma is a minor modification of ET1 Lemma 10]. 

Lemma 1.12: Suppose that P, P and Q are three basis matrices, P and P are of same size. Also, Q'P = and ||(7 — 
PP')Ph < C- Then, 

1) - PP')PP'\\ 2 - ||(/-PP')PP'||a = - PP')P\\ 2 = - PP')P\\ 2 <C 

2) \\PP' - PP'\\ 2 < 211(7- PP')P\\ 2 < 2C+ 

3) \\ P'Qh < (+ 

4) \jl-Qt 2 < - PP')Q) < 1 

Proof: The result follows exactly as in the proof of [21 Lemma 10]. ■ 
3) Sparse Recovery Error Bound: The following is a minor modification of [22 Theorem 1] applied to exact sparse signals. 



Theorem 1.13 (H22\l): Suppose we observe y := + z where z is the noise. Let x be the solution to following problem 

min i subject to II 3/ — ^x\\ 2 < £ (3) 
Assume that x is s-sparse, |jz||2 < £, and S 2s ( 1 i') < b < — 1). The solution of (O obeys \\x — x\\ 2 < C±£ with 

01 ' 1-(V2+1)6" 

II. Problem Definition and Model Assumptions 
We give the problem definition below followed by the model and three key assumptions. 



A. Problem Definition 

The measurement vector at time t, M t , is an n dimensional vector which can be decomposed as 

M t =L t + St (4) 

Here St is a sparse vector with support set size at most s and minimum magnitude of nonzero values at least S m i n . L t is a 
dense but low dimensional vector, i.e. L t = P(t) a t where P( t ) is an n x rn\ basis matrix with -C n, that changes every 
so often. Pm and a t change according to the model given below. We are given an accurate estimate of the subspace in which 
the initial < tra ; n 7 t 's lie, i.e. we are given a basis matrix Pq so that ||(7 — Po7o)Po||2 ' s sma H- Here Po is a basis matrix for 
span(£i lrajn ), i.e. span(Po) = s P an {£u rA1 „) ■ Also, for the first t asL m time instants, St is either zero or very small. The goal is 

1) to estimate both St and L t at each time t > £train> an d 

2) to estimate span(P( t )) every-so-often, i.e., update P^ so that the subspace estimation error, SE( t ) := ||(7— P(t)P! t \)P(t) II2 
is small. 
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I ! I ! — I I — J 



P(t) = Po P(t) = Pl = [Pa \ Pl,old. ^.Mf] P(t) = Pj = [Pj-l \ Pjflld. Pj,n*w] 



Fig. 1. The subspace change model given in Sec III-Al Here to — 0. 



Notation for S t . Let T t := {i : (S t )i ^ 0} denote the support of S t . Define 



S min := min min|(5 t )j 
t>t tt » ieT t 



and s :— max \T*\ 



Assumption 2.1 (Model on L t ): We assume that L t = Pit)0>t where Pm and at satisfy the following. 

1) P( t ) — Pj for all tj<t< ij+i, j = 0, 1, 2 • • • J, where Pj is an n x rj basis matrix with rj <C n and rj <C (ij+i — tj)- 
We let <o = and tj+i equal the sequence length. This can be infinity also. At the change times, tj, Pj changes as 
P 3 = [{Pj-i \ Pj,oid) Pj,new\- Here, Pj, new is an n x c J>ew basis matrix with -P^ new -P,-i = and P i)0 i d contains c,- )0 jd 
columns of Pj-i- Thus rj = 7\,_i + new — Cj )0 i<j. Also, < £ tra j n < t\. This model is illustrated in Fig. Q] 

2) There exists a constant c max such that < Cj-, new < c max and X^=i( c »,new-<7oid) < c max for all j. Let r max := r +c max . 
Thus, = r + Si=i( c i,new - Cj,oid) < m + c max = r max , i.e., the rank of Pj is upper bounded by r max . 

3) a t := Pn-s'Lt, is a rj length random variable (r.v.) with the following properties. 

a) a t 's are mutually independent over t. 

b) a t is a zero mean bounded r.v., i.e. E(a t ) = and there exists a constant 7* such that ||dt||oo < 7* f° r all 

c) Its covariance matrix A t := Cov[a t ] = Ei(a t a' t ) is diagonal with A~ := min t A m i n (A t ) > and A + := 
maxt A max (At) < 00. Thus, the condition number of any A t is bounded by / := 



Also, Pj and a t satisfy the assumptions discussed in the next three subsections. 
Definition 2.2: The following notation will be used frequently. Let Pj^ 



Pi 



(tj-i) 



at,-. 



Pj-l. For t e [tj,tj+i - 1], let 
Pj,* Lt = Pj-i L t be the projection of L t along Pj^ of which a t ^^ z := {Pj-i \ Pj, \d)'L t is the nonzero part. Also, 
3W := Pj ne „L t be the projection of L t along the newly added directions. Thus, 



let o tj „ 



where is a Cj )0 id length zero vector (since Pj t0 id'Lt = 0). Using the above, for t £ [tj,tj + i — 1], L t can be rewritten as 







and a t — 




at,* — 







&i,new 



L t — PjO-t — {Pj-l \ Pj,old) a t,*,nz + Pj. 



Pj,*a,t,* + Pj, 



vO-tj 



and At can be split as 



A t = 



where (A t )* )nz := Cav(a t ,* <m ) and (A t )„ 



(A t )*,nz 
(A t )„e 

Cov(a t]I1 ew) are diagonal matrices. 



B. Slow subspace change 

By slow subspace change we mean all of the following. 

1) First, the delay between consecutive subspace change times, tj+i — tj, is large enough. 

2) Second, the projection of L t along the newly added directions, a t new , is initially small, i.e. max f] < t<ij+a ||at,new||oo < 
7new, with 7„ ew <C 7* and 7 new -C S m - m , but can increase gradually. We model this as follows. Split the interval 
[tj,tj+i — 1] into a length periods. We assume that 

< 7new,fc := min(w fe_1 7 ne w,7*) 



max max at.newnoo 

j te[tj+(k-l)a,tj+ka-l] 



•=j.-l = •« 



Gj.(k> "s 



t i = 6 



Fig. 2. We illustrate the clustering assumption. Assume At = Aj. . 

for an>l but not too largeQ. This assumption is verified for real video data in [21. Sec X-B]. 
3) Third, the number of newly added directions is small, i.e. Cj jBev/ < c max <C rrj. This is also verified in lETl Sec X-B]. 



C. Measuring denseness of a matrix and its relation with RIC 
For a tall n x r matrix, B, or for a n x 1 vector, B, we define the the denseness coefficient as follows |2D : 

WIt'BWo 

k s (B) := max 11 ., " 2 . (5) 

|T|<s ||B|| 2 

where |j.||2 is the matrix or vector 2-norm respectively. Clearly, k s (B) < 1. As explained in OH . k s measures the denseness 
(non-compressibility) of a vector B or of the columns of a matrix B. For a vector, a small value indicates that its entries are 
spread out, i.e. it is a dense vector. A large value indicates that it is compressible (approximately or exactly sparse). Similarly, 
for annxr matrix B, a small n s means that most (or all) of its columns are dense vectors. 
For a basis matrix P, k s (PP') = k s (P) and thus n s (P) is a property of span(P) ETl . 

Remark 2.3: A better way to quantify denseness of a matrix B would be to define the denseness coefficient as 
maxix|< s || It'Q(B)\\-2 where Q(B) is a basis matrix for span(B), e.g. it can be obtained by QR decomposition on B. 
This definition will ensure that the denseness coefficient is a property of span(£>) for any matrix B. It is easy to see that 
||/t'-B||2 < ||/t'Q(-B)||2||B||2. Thus, even with this new definition, all our results, and all results of (2TJ, will go through 
without any change. However, we keep the definition of (0 because it was used in [21] and the current work uses certain 
lemmas from [21 j. 

The following lemma was proved in ll2~Tl . 

Lemma 2.4: For an n x r basis matrix P (i.e P satisfying P'P = I), 

5.(1 - PP') = k 2 s {P). 

In other words, if P is dense enough (small k s ), then the RIC of / — PP' is small. As we explain in ET1 Sec IV-D], k 8 (B) 
is related to the denseness assumption required by PCP [0. 



D. Clustering assumption 

For positive integers K and a, let tj := tj + Ka. We set their values in our main result, Theorem 14.11 Recall from the 
model on L t and the slow subspace change assumption that new directions, Pj. ne w, get added at t = tj and initially, for the first 
a frames, the projection of L t along these directions is small (and thus their variances are small), but can increase gradually. 
It is fair to assume that by t = tj, the variances along these new directions have stabilized and do not change much for 
t £ [tj,tj+i — 1]. It is also fair to assume that the same is true for the variances along the existing directions, Pj-i. In other 
words, we assume that the matrix A t is either constant or does not change much during this period. Under this assumption, 

1 Small 7ne W and slowly increasing 7 new j. is needed for the noise seen by the sparse recovery step to be small. However, if 7 new is zero or very small, it 
will be impossible to estimate the new subspace. This will not happen in our model because 7 ni;w > A~ > 0. 
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we assume that we can cluster its eigenvalues (diagonal entries) into a few clusters such that the distance between consecutive 
clusters is large and the distance between the smallest and largest element of each cluster is small. We make this precise below. 
Assumption 2.5: Assume the following. 

1) Either At = Aj. for all t 6 [tj, tj+i — 1] or A t changes very little during this period so that for each i = 1, 2, • • • , Tj, 

min te[t 3 ,t 3+1 -i] A *( A *) > max te [t,,t, +1 -i] Ai+i(A t ). 

2) Let Gj,(i),Gj,(2), - ■ ■ >Gj,(#j) be a partition of the index set {1,2, ...r,} so that mm ie g. (k) min te[ j_. jtj+1 _i] Aj(A t ) > 
max^g. (k+1) maxjgp. , tj+1 -i] ^i(A-i), i- e - the first group/cluster contains the largest set of eigenvalues, the second one 
the next smallest set and so on (see Fig [2]). Let 

a) Gj t k := {Pj)g j {k) be the corresponding cluster of eigenvectors, then Pj = [Gj t \, Gj t i, • • • > Gj,^j]< 

b) Cj t k '■— \Gj,(k) \ be the number of elements in GjXh), then 2~2k=i ^j.k = r j\ 

c) Aj-fe" := min iee .. (fc) min t£[t - 3 . t A;(A t ), Aj, fe + := max l6 ^ (fc) max 4e[t -. it , +l _ x] A. ( (A t ) and \j^ j+ i + := 0; 

d) 9j,k ■= ^j,k + /^j,k" (notice that g hk > 1); 

e) hj,k ■= Xj,k+i + l^j,k~ (notice that hj. k < 1); 

f) ffmax := max 3 maxfc = i )2) ... ,^ g 3 ,k, h max := maxj maxfc = i )2) ... h jlk , c min := min^ min fe= i i 2,... ,-& } c jtk 

g) !?max := max, 1?j 

We assume that g max is small enough (the distance between the smallest and largest eigenvalues of a cluster is small) 
and h max is small enough (distance between consecutive clusters is large). We quantify this in Theorem 14. 11 
Remark 2.6: The assumption above can, in fact, be relaxed to only require the following. The matrices A t are such that 
there exists a partition, Gj,(i), Gj,(2) > • ■ • , Gj,{-&j), of the index set {1, 2, . . . rj} so that minjgg^. {k) min t6 rj. t j+1 -i] ^i(At) > 
maxigg^. {k+1) max 4e p. tj+1 -i\ \ {^-t)- Define all quantities as above. We assume that g max and h max are small enough. 

III. ReProCS with cluster-PCA (ReProCS-cPCA) 

We first briefly recap the main idea of projection-PCA (proj-PCA) which was used in [21 1. The ReProCS with cluster-PCA 
(ReProCS-cPCA) algorithm is then explained. In Sec IIII-C1 we discuss how to set its parameters in practice when the model 
may not be known. The need for proj-PCA is explained in Sec IIII-DI We need the following notation. 

Definition 3.1: Let tj := tj + Ka. Define the following time intervals 

1) lj <k ■= [tj + (k- l)a, tj + ka - 1] for k = 1, 2, • • • , K. 

2) lj <k := [tj + {k- l)a, tj + ka. - 1] for k = 1, 2, • • • , d r 

3) ij,dj+x ■= [tj +"&j&,tj+i - 1]. 

Notice that [tj,t j+1 - 1] = (uf =1 I iifc ) U (U^-.^Ul Also, K, a and a are parameters given in Algorithm |2] 

A. The Projection-PCA algorithm 

Given a data matrix T>, a basis matrix P and an integer r, projection-PCA (proj-PCA) applies PCA on P pro j := (7 — PP')T>, 
i.e., it computes the top r eigenvectors (the eigenvectors with the largest r eigenvalues) of -^ r D vm ^D vmj l . Here ax> is the 
number of column vectors in V. This is summarized in Algorithm Q] 

If P = [.], then projection-PCA reduces to standard PCA, i.e. it computes the top r eigenvectors of -^-VD' . 

We should mention that the idea of projecting perpendicular to a partly estimated subspace has been used in different contexts 
in past work l36l. l8l. 

Algorithm 1 projection-PCA: Q <- proj-PCA(X>, P, r) 

1) Projection: compute V pm j <- (J - PP')T> 

2) PCA: compute -^V^p^ E =° [qq ± 
of columns in T>. 



'a " 




' Q' 


Aj_ 




Q±' 



where Q is an n x r basis matrix and ax> is the number 
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Algorithm 2 Recursive Projected CS with cluster-PCA (ReProCS-cPCA) 

Parameters: algorithm parameters: £, u, a, a, K, model parameters: tj, ro, Cj-, new , $j and 2j < 

Input: n x 1 vector, M t , and n x ro basis matrix Pq. Output: n X 1 vectors St and L t , and n x ru) basis matrix P/ t ). 
Initialization: Let P(t lmm ) <— Po- Let j <— 1, k<— 1. For t > £train> do the following: 

1) Estimate T t and S t via Projected CS: 

a) Nullify most of L t : compute $( f ) <— I — P(t-i)P[ t -\y Vt ^~ ${t)M t 

b) Sparse Recovery: compute S t . cs as the solution of min,,; ||x||i s.t. \\y t — $u\x\\2 < £ 

c) Support Estimate: compute T t — {i : \(§t a )i\ > w } 

d) LS Estimate of S t : compute (§ t )f t = ((*t)f t ) t 2/t> (&)t c = 

2) Estimate L t . L t = M t - S t . 

3) Update P (t) : 

a) If t 7^ <j + qa — 1 for any q — 1, 2, ... if and t ^ tj + Ka + §jdt — 1, 
i) set P (t) <- P ( t_ 1} 

b) Addition: Estimate span(P J new ) iteratively using proj-PCA: If t = tj + ka — 1 

i) £j,new,fc <~ Proj-PCA([L t ; t G lj,k] , Pj-1, Cj,new) 

ii) set P (t) <- [P,-_i P; 

,new,/cj ■ 

iii) If fc = K, reset fc 1; else increment k 4— k + 1. 

c) Deletion: Estimate span(P ? ) by cluster-PCA: If t = tj + Ka + dja — 1, 

i) For i = 1,2, • • • 

. Gj ti <- proj-PCA([L t ;< e 2j-, fe ], [6^1,^2, ■ ■ • C-\.< 1 • ' j ' ' 
End for 

ii) set Pj <— • • • , Gj,i) 3 ] and set P( t ) P,-. 
iii) increment j 4— j + 1. 



B. The ReProCS-cPCA algorithm 

ReProCS-cPCA is summarized in Algorithm [2] It proceeds as follows. The algorithms begins with the knowledge of Pq 
and initializes P(t, riin ) <— Pq- Po can be computed as the top r left singular vectors of 7W tlrain (since, by assumption, S tlrain 
is either zero or very small). For t > £train> the following is done. Step 1 projects M t perpendicular to P t _ 1 ), solves the l\ 
minimization problem, followed by support recovery and finally computes a least squares (LS) estimate of St on its estimated 
support. This final estimate St is used to estimate L t as L t = M t — St in step 2. The sparse recovery error, e t '■= St — St- 
Since L t = M t — St, et also satisfies e t — L t — L t . Thus, a small et (accurate recovery of St) means that L t is also recovered 
accurately. Step 3a is used at times when no subspace update is done. In step 3b, the estimated it's are used to obtain improved 
estimates of span(P 7 . new ) every a frames for a total of Ka frames using the proj-PCA procedure given in Algorithm [T] As 
explained in lETl . within K proj-PCA updates (A" chosen as given in Theorem 14. U , it can be shown that both \\etW2 an d the 
subspace error, SE( t ) := ||(7 — P(t)P[t))P(t)\\2, drop down to a constant times £. In particular, if at t — tj — 1, SE( t ) < r(, 
then at t = tj := tj + Ka, we can show that SEu) < (r + c max )C- Here r := r max = ro + c max . 

To bring SE/ t ) down to rC, before i 3 -+i, we need a step so that by t — tj + i — 1 we have an estimate of only span(Pj), i.e. we 
have "deleted" span(P 7 - i c j). One simple way to do this is by standard PCA: at t = tj +a — 1, compute Pj 4- proj-PCA([Z t ; t £ 
and let Pu\ <— Pj. Using the sin^ theorem and the Hoeffding corollaries, it can be shown that, as long as / is 
small enough, doing this is guaranteed to give an accurate estimate of span(P,). However / being small is not compatible 
with the slow subspace change assumption. Notice from Sec [TT] that A~ < 7„ ew and E[||L t |||] < rA + . Slow subspace change 
implies that 7 new is small. Thus, A~ is small. However, to allow L t to have large magnitude, A + needs to be large. Thus, 
/ = A + /A~ cannot be small unless we require that L t has small magnitude for all times t. 
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ftD = Pj = Vj-i^SPj.m Pj, n *„l for tj =£ t < t, + 1 



Subspace First Second 

change proj-PCA proj-PCA 

time 



Addition is done Deletion is done 

I I 

tj tj+O. ty+2ct ty+(K-l)0[ TLj = tj+KCL f,+ SI 2 » E, -1- ("Sy -1 > W Zj+ -&J ~8L £>+■ 

J I I I I I I I I L 

' * ' ' ■ " v '| , ' 

- P < J - L > v v ' p m - I'o-o "i. --il P m - p, I P m - Pi 

Pen = Ctj-i5 1 

^ j Estimate by cluster^PCA 

Estimate Pj,nsw °Y K times proJectlon-PCA 



Fig. 3. A diagram illustrating subspace estimation by ReProCS-cPCA 



In step 3c, we introduce a generalization of the above strategy called cluster-PCA, that removes the bound on /, but instead 
only requires that the eigenvalues of Cov(L t ) be sufficiently clustered as explained in Sec lII-DI The main idea is to recover one 
cluster of entries of Pj at a time. In the k th iteration, we apply proj-PCA on [L t ; t £ Ij,k\ Wim P 4— [Gj t \,Gjfl, . . . Gj k-i]) to 
estimate span(Gj fe). The first iteration uses P 4— [.], i.e. it computes standard PCA to estimate span(Gj i). By modifying the 
approach used in [21] for analyzing the addition step, we can show that since gj^ and hj k are small enough (by Assumption 
12.5b . span(Gj.fc) will be accurately recovered, i.e. — X)j=i GjjGj i)Gj t f-\\2 < Cj,kG- We do this dj times and finally we set 
Pj <- (/..i.C;, j . ..djjj] and P {t) «- P r All of this is done at t = ij+fya-l. Thus, at this time, SE (t) = ||(/-PjPj)Pj|| 2 < 
J2k=i Wi 1 ~ Z)i=i Gj^G'j ^G^kh < EfcLi 5j,fcC = r jC < r C Under the assumption that t j+1 - tj > Ka + max<5, this 
means that before the next subspace change time, tj+i, SEm is below r£. 

We illustrate the ideas of subspace estimation by addition proj-PCA and cluster-PCA in Fig. [3] We discuss the connection 
between proj-PCA done in the addition step and the cluster-PCA (for deletion) step in Table U given in Sec IV-CI 

C. Practical Parameter Settings 

The ReProCS-cPCA algorithm has parameters £, to, a, a, K and it uses knowledge of model parameters tj, rrj, Cj, ne „, i?j 
and Cj.i- If the model is known the algorithm parameters can be set as in Theorem 14.11 In practice, typically the model is 
unknown. In this case, the parameters tj, rrj, Cj >new , w, K can be set as explained in lETI . The parameters i3j and cjj for 
i = 1,2... *&j, can be set by computing the eigenvalues of A Y^,tel i LtL' t and clustering them using any standard clustering 
algorithm, e.g. k-means clustering or split-and-merge§. We pick a and a somewhat arbitrarily. A thumb rule is that a and a 
need to be at least five to ten times c max and maxj max^i^...^ Cj : i respectively. From simulation experiments, the algorithm 
is not very sensitive to the specific choice. 

D. The need for Projection-PC A 

The reason standard PCA cannot be used and we need proj-PCA is because et = Lt — Lt is correlated with Lt- The 
discussion here also applies to recursive or online PCA which is just a fast algorithm for computing standard PCA. In most 
existing works that analyze finite sample PCA, e.g. see 11271 and references therein, the noise or error in the "data" used for 
PCA (here it's) is uncorrected with the true values of the data (here it's) and is zero mean. Thus, when computing the 
eigenvectors of (1/a) J2t ^tL't, the dominant term of the perturbation, (1/a) J^t LtL't ~ (l/ a ) St LtL' t , is (1/a) J2t e * e t ( m e 
terms (1/a) J2t ^* e t anc ^ its transpose are close to zero w.h.p. due to law of large numbers). By assuming that the error/noise 
et is small enough, the perturbation can be made small enough. 

2 One simple split-and-merge approach is as follows. Start with a single cluster. Split into two clusters: select the split so that j max is minimized. Split 
each of these clusters into two parts again while ensuring cj ma x is minimized. Keep doing this for d\ steps. Notice that, with every splitting, g m ax will 
either remain the same or reduce, however /i ma x will either remain same or increase. Then, do a set of merge steps: in each step find the pair of consecutive 
clusters to merge that will minimize /imax- 



12 



However, for our problem, because e f and L t are correlated, the dominant terms in the perturbation seen by standard PCA 
will be (I /a) J^. L t et and its transpose. Since L t can have large magnitude, the bound on the perturbation will be large and 
this will create problems when applying the sin6> theorem (Theorem 1 1.8) to bound the subspace error. On the other hand, when 



using proj-PCA, L t gets replaced by (7 — A_ 1 )L< (in the addition step) or by (7 — Yli=i GiG[^)L t (in cluster-PCA) and 
this results in significantly smaller perturbation. We have explained this point in detail in Appendix F of lETl . 

IV. Performance Guarantees 

We state the main result first and then discuss it in the next subsection. We give its corollary for the case where / is small 
in Sec IIV-CI The proof outline is given in Sec [V] and the proof is given in Sec IVII 

A. Main Result 

Theorem 4.1: Consider Algorithm [2] Let c := c max and r := ro + c. Assume that L t obeys the model given in Assumption 
12.11 Also, assume that the initial subspace estimate is accurate enough, i.e. ||(7 — Po^o)-Fb|| < r oC f° r a C mat satisfies 

■ < 10-4 l A x 10 ~ 4 1 ^ u f A+ 

C < min( , - ) where / := — 

[r + c) 1 (r + c) z j {r + c) A "fi A 

Let £o(C)> Pi -^(C)> a add(C)> a dei(C)' 9j,k t> e as defined in Definition 15.21 If the following conditions hold: 

1) (algorithm parameters) £ = £o(C)j < w < &nin - 7p£, X = K(Q, a > a a dd(C)j « > adei(C)> 

2) (denseness) 

maXK 2s (Pj_i) < „ = 0.3, max/t 2s (-Pi,new) < K^, new = °- 15 > 

3 3 

max max K 2s (Dj,nev,,k) < «t = 0.15, max max K2 S (Q 3 , ne w,fc) < = 0.15, 
i o</c<if j o<fc<if 

max/c s ((7 - Pj-i-Pj-i - Pj,^,KPj,new,K) p j) < K L 

where Dj } tiew,k • (-^ j—l j—l ^?\ new 5&^,new,fc)^Jj new ' and Qj^n&N.k • (7 Tj new Tj jIlew )7j',new,fc and 7j ; new,0 [•]» 

3) fj/ow subspace change) 

max max ||a t , n ew||oo < 7new.fc := min(1.2 fc_1 7 new , 7*), for all fc = 1, 2, . . . if, 

3 tex jk 

14p£o(C) < S min , 

4) (small average condition number of Cov(at, n ew)) 9j,k < .9 + := a/2> 

5) (clustered eigenvalues) Assumption |23J holds with g max , ft roax , c roill satisfying /fctSmax, Vax) - > 
where /d ec (g ma x, h max ) and / mc (j max ,A m ax) are defined in Definition 15.31 (also see Remark [731 which weakens this 
requirement), 

then, with probability at least 1 — 2n~ 10 , at all times, t, 

1) f t = T t and ||e t || 2 = \\L t - L t \\ 2 = \\S t - S t \\ 2 < 0.18^ lnew + 1.24vT- 

2) the subspace error, SEm satisfies 

'o.e 1 "- 1 + r( + QAc( ifteXj.fe, fc = l,2,-- - ,A" 
SE (t) < < (r + c)C if t G U^^fc 



r( if t£lj^ j + 



< 



Otf- 1 + 10- 2 VC if t G 2,-, fc , fc = 1,2, • • • ,7T 
10- 2 if * G (Ufi^-fc) U J ji#J+1 
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3) the error e t = St — St — L t — L t satisfies the following at various times 



1.17[0.15 • 0.72 fe - 1 v ^ 7new + 0.15 • 0.4cCv^7* + rCv^T*] if t e Zj,k, k = l,2,---,K 

fa 



• , || 2 : \ 1.17(r + c)CV^7* if t e utL^k 

1.17rCv^7* if ielj.ijj+i 

0.18 ■0.72 fc - 1 ^ 7new + 1.17 -1.06VC if iel,^, fc = 1, 2, • • • , # 



< 



1.17VC if t g (u^ij-fe) uij-.^.+i 



The above result says the following. Assume that the initial subspace error is small enough. If the assumptions given in 
the theorem hold, then, w.h.p., we will get exact support recovery at all times. Moreover, the sparse recovery error (and the 
error in recovering L t ) will always be bounded by 0.18^/cj nevl plus a constant times ^/C- Since £ is very small, 7 new <C S'min, 
and c is also small, the normalized reconstruction error for St will be small at all times, thus making this a meaningful 
result. In the second conclusion, we bound the subspace estimation error, SE( t ). When a subspace change occurs, this error is 
initially bounded by one. The above result shows that, w.h.p., with each adddition proj-PCA step, this error decays roughly 
exponentially and falls below (r + c)£ within K steps. After the cluster-PCA step, this error falls below r£. By assumption, 
this occurs before the next subspace change time. Because of the choice of (, both (r + c)( and r£ are below O.Ol-y/C The 
third conclusion shows that the sparse recovery error as well as the error in recovering L t decay in a similar fashion. 



B. Discussion 

Notice from Definition 15 .21 that K = K{Q is larger if £ is smaller. Also, both a a dd(C) and ay e i(C) are inversely proportional 
to £. Thus, if we want to achieve a smaller lowest error level, £, we need to compute both addition proj-PCA and cluster-PCA's 
over larger durations, a and a respectively, and we will need more number of addition proj-PCA steps K. Because of slow 
subspace change, this means that we also require a larger delay between subspace change times, i.e. larger — tj. 

1) Comparison with ReProCS: The ReProCS algorithm of [21] is Algorithm |2] with step 3c removed and replaced by 
Pj <— [Pj-i, Pj,nevi,K\- Let us compare the above result with that for ReProCS for the subspace change model of Assumption 
12.11 [21, Corollary 43]. First, ReProCS requires K2s([Poi -Pt,new> • • • -Pj,new]) < 0.3 whereas ReProCS-cPCA only requires 
maxj K 2s (Pj) < 0.3. Moreover, ReProCS requires ( to satisfy ( < min( {ro+ \°j^ 1)c yi , (^+(,7-1)^/ , {ro+{ jl 1)c yA^i ) whereas 
in case of ReProCS-cPCA the denominators in the bound on £ only contain r + c = tq + 2c (instead of tq + ( J — l)c). 

Because of the above, in Theorem 14. II for ReProCS-cPCA, the only place where J (the number of subspace change times) 
appears is in the definitions of a a dd and «dei- Notice that a a dd and a^ei govern the delay between subspace change times, 
tj+i — tj. Thus, with ReProCS-cPCA, J can keep increasing, as long as tj+i — tj also increases accordingly. Moreover, notice 
that the dependence of a a dd and add on J is only logarithmic and thus tj+i — tj needs to only increase in proportion to log J. 
On the other hand, for ReProCS (see lETl Corollary 43]), J appears in the denseness assumption, in the bound on £ and in 
the definition of a a dd- Thus, ReProCS needs a bound on J that is indirectly imposed by the denseness assumption. 

The main extra assumptions that ReProCS-cPCA needs are (i) the clustering assumption (Assumption 12.51 with 
/Wx,5m ax being small enough to satisfying /d ec (5max, ^max) - — g°"^ m ° x) > °'> and (") max j K s((I - Pj-iP-^ - 
Pj,nevi,KPj n ew k)Pj) < K t e- The second assumption is similar to the denseness assumption on £)j )new ,fc which is required by 
both ReProCS and ReProCS-cPCA. This is discussed in ETI . The clustering assumption is a practically valid one. We verified 
it for a video of moving lake waters shown in http://www.ece.iastate.edu/~chenlu/ReProCS/ReProCS.htm as follows. We first 
"low-rankified" it to 90% energy as explained in ET\ Sec X-B]. Note that, with one sequence, it is not possible to estimate 
A t (this would require an ensemble of sequences) and thus it is not possible to check if all A t 's in [tj,tj+% — 1] are similar 
enough. However, by assuming that A t is the same for a long enough sequence, one can estimate it using a time average and 
then verify if its eigenvalues are sufficiently clustered. When this was done, we observed that the clustering assumption holds 
with g max = 7.2 and /i max = 0.34. 
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2) Comparison with PCP: We provide a qualitative comparison with the PCP result of [2|. A direct comparison is not 
possible since the proof techniques used are very different and since we solve a recursive version of the problem where as PCP 
solves a batch one. Moreover, PCP provides guarantees for exact recovery of St and £ t . In our result, we obtain guarantees 
for exact support recovery of the St's (and hence of St) and bounded error recovery of its nonzero values and of £ t . Also, the 
PCP algorithm assumes no model knowledge, whereas our algorithm does assume knowledge of model parameters. Of course, 
in Sec IIII-C1 we have explained how to set the parameters in practice when the model is not known. 

Consider the denseness assumptions. Let £ t = UT,V' be its SVD. Then, for t £ [tj, tj + i — 1], U = 
[Po, Pi,mw, P2,mw, ■ ■ ■ Pj.new] and V = [ai , a2 . . . at]' £~ 1 . The result for PCP [2| assumes denseness of U and of V: it 
requires ki(U) < \J pr/n and Ki(V) < \J pr/n for a constant fx > 1. Moreover, it also requires ||fV||max < -jjj/r/n. On 
the other hand, ReProCS-cPCA only requires K2s(Pj) < 0.3 and /«2s (■Fj.new) < 0.15. It does not need denseness of the entire 
U; it does not assume anything about denseness of V; and it does not need a bound on ||£/V'||max- 

Another difference is that the result for PCP assumes that any element of the n x t matrix St is nonzero w.p. g, and zero 
w.p. 1 — q, independent of all others (in particular, this means that the support sets of the different St's are independent over 
time). Our result for ReProCS-cPCA does not put any such assumption. However it does require denseness of the matrix 
P>j.new,k whose columns span the unestimated part of span(Pj. new ) for t £ As demonstrated in Sec. IVIIII this reduces 

(k s (Dj n ew,fc) increases) if the support sets of St$ change very little over time. However, as long as, for most k, K s (Dj new /.) 
is anything smaller than one, which happens as long as there is at least one support change during Ij f., the subspace error 
does decay down to a small enough value within a finite number of steps. The number of steps required for this increases 
as K s (Dj ineWi k) increases. Since « S (-Dj, ne w,fc) cannot be computed in polynomial time, for the above discussion, we computed 
||/T t / -Dj,new,fc||2/||i}j,new,fc||2 at t = tj + ka — 1 for k = 0, 1, . . . K. In fact, our proof also only needs a bound on this latter 
quantity. 

Also, some additional assumptions that ReProCS-cPCA needs are (a) accurate knowledge of the initial subspace and slow 
subspace change; (b) denseness of Qj, nevl .k', (c) the independence of a t 's over time; (d) condition number of the average 
covariance matrix of a t new is not too large; and (e) the clustering assumption. Assumptions (a), (b), (c) are discussed in detail 
in [ 21 1 and (a) is also verified for real data. As explained in 11211 . (c) can possibly be replaced by a weaker random walk model 
assumption on at's if we use the matrix Azuma inequality [26 1 instead of matrix Hoeffding. Assumption (e) is discussed above, 
(d) is also an assumption made for simplicity. It can be removed if a clustering assumption similar to Assumption 12.51 holds 
for (A t ) new = Cov(a4. new ) during t 6 [tj,tj — 1] and we use an approach similar to cluster-PCA. If there are $ new .j clusters, 
we will need i9„ew.j proj-PCA steps to estimate P nevj .k (instead of the current one step). At the I th step, we use proj-PCA with 
P being Pj-\ concatenated with the basis matrix estimates for the last / — 1 clusters to recover the I th cluster. 

C. Special Case when f is small 

If in a problem, L t has small magnitude for all times t, then /, which is the maximum condition number of Cov(L t ) for 
any t, can be small. If this is the case, then the clustering assumption trivially holds with i3j — 1, Cj t % — Tj, g max = gj : \ = f 
and h max = hj t i = 0. Thus, $ max = 1. In this case, the following corollary holds. 

Corollary 4.2: Assume that the initial subspace estimate is accurate enough as given in Theorem 14. 1 1 with £ as chosen there. 
Also assume that the first four conditions of Theorem 14. 1 1 hold . Then, if / is small enough so that fi nc (f, 0) < fdec(f, 0)c m i n C, 
then, all conclusions of Theorem 14.11 hold. 

Notice that the above corollary does not need Assumption 12.51 to hold. 

V. Definitions, Proof Outline and Connection between addition and deletion steps 

In Sec IV-A1 we define all the quantities that are needed for the proof. The proof outline is given in Sec IV-BI We discuss 
how the proof strategy for the cluster-PCA (for deletion) step is related to that of addition proj-PCA in Sec IV-CI 
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A. Definitions 

Certain quantities are defined earlier in Assumptions 12. 1 l and 12.51 in Definitions 12.21 and 13.11 in Algorithm [2] and in Theorem 

ED 

Definition 5.1: In the sequel, we let 

1) c := c max and r := r max = r + c and so r\, = r + X/Li( c i,new ~ c;,oid) < r, 

2) </>+ := 1.1735 

Definition 5.2: We define here the parameters used in Theorem 14. 1 1 

1) Define K(Q := [^g^~ 

2) Define £ (C) := V^W + 1.06^? 

3) Define p :— max t {Ki(S't ;CS — St)}. Notice that p < 1. 

4) Define the condition number of the average of Cov(a fnew ) over t £ X, ^ as 

^7,new.fc j 

9j,k ■= y — - - where 

A j, new, k 

^j,n&w,k • — -^max( / (^t)new)) ^j,new,A; - = ^miii( / ( j ^i)new): 
a A — ' OL A — 4 

5) Let K — K(Q. We define a a dd(C) as m E l the smallest value of a so that (pit (a, C)) KJ > 1 — n -10 , where Pk(cx, C) 
is defined in ||2T1 Lemma 35]. An explicit value for it II2T1 is 

8 24^ 16 
«add(C) = [(log6^J + 11 logn)— -2 max(min(1.2 4K 7n 4 ew , 7 4 ), _ , 4(0.186 7n 2 ew + 0.0034 7new + 2.3) 2 )1 

In words, a a( jd is the smallest value of the number of data points, a, needed for an addition proj-PCA step to ensure that 
Theorem 14. 1 1 holds w.p. at least (1 — 2n~ 10 ). 

6) We define a<j e i(C) as tne smallest value of a so that p(a, ()* aM '' > 1 — n~ w where p(a, C) is defined in Lemma 17781 
We can compute an explicit value for it by using the fact that for any x < 1 and r > 1, (1 — x) r > 1 — rx and that 



Ei=i e ^ <6e °—i=i,2... 6 ^. We get 
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2 



add(C) := r(log6i? max J + lllogn)^— ^ max(4.2 2 ,4&2)] 

(s A ) 

where bj :— (\/r 7 * + <t> + \fC,) 2 and <fi + = 1.1732. In words, ayei is the smallest value of the number of data points, a, 
needed for a deletion proj-PCA step to ensure that Theorem 14. 1 1 holds w.p. at least (1 — 2n~ 10 ). 
Definition 5.3: Define the following. 

1) C+ := K 

2) define the series {Cfc }fc=o,i 2,—K as follows 

6 + 0.125cC 

l-(C* + ) 2 -(C* + ) 2 /-0.25cC-6' 



Co ■■= 1, C fe + := - ,, + , a _,,» + \ 2 , noc _, * for k > 1, (6) 



where b := + <>+) V(C fc + _ x ) 2 + C'/(C + ) 2 , «+ := 0.15, C := (^Ty? + <t> + ), C ' : = ((^ + ) 2 + 

2 ^ + ^l + ^>++ ^ + f ( *T )» g:^((0 + ) 2 + f (0+)2 )■ 



\/!-(C. + ) 2 _ Vi-(C + ) 2 Vi-(C + ) 2 ' VMC?) 

3) define the series {Ck~}fc=i,2,— as follows 

>+ _ fincQh, hk) 



fdec(9k, hk) 

where / mc (g, ft) := (r + c)C[3 K + e 0+.g + [«+ e 0+ + «+ e (l + 20+)-^gL=]ft + [£c + 4rC< e </>+ + 2(r + c)C(l + 

^ e 2 )0 +2 ]/ + - 2 ^]. and fdec{g,h) := l-h-0.2C,-r 2 C, 2 f -r 2 C, 2 - f mc {g, ft). Notice that f inc (g,h) is an increasing 
function of g, ft and fdec(g, h) is a decreasing function of g, ft. 
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As we will see, £+, (T, ( k are me high probability upper bounds on Q, k , Cj.fc (defined in Definition 15.81 ) under the 
assumptions of Theorem 14.11 

Definition 5.4: For the addition step, define 

1) := / - Pj-iP'^ - I '..!:;■« ./. I .;, and := J - Pf-li^. 

2) (f> k := maxj max T .| T |< s || ((*j,fe)r' (^j,^ )r ) _1 1| 2 - It is easy to see that fa < ^^.^(^ k ) - 

3) Dj,new,k • — ^ j ,k Pj ,new &nd Dj.new . — P^j,new.O ^j^O^jjiiew- 

For the cluster-PCA step (for deletion), define 

1) *i,k:= J -l£o 

2) Gj,det,fe := • ■ • and C' M u-t.;, := ■ ■ • , Gj,fe_i]. Notice that % )/s =/ - Gj-,det,fe+l^ jdet 

3) Gj :Un det,fc := [Gj,k+1 • • • jGj,^]- 

4) Dj t k :— ^j,k-iGj t k, Dj ds , t k := &j,k-iGj4et,k and Dj undeUk := ^/c-iG^undet.fc- 

Here, Gj.det,fc contains the directions that are already detected before the A:*' 1 step of cluster-PCA; Gj ik contains the directions 
that are being detected in the current step; Gj !Un d e t,fc contains the as yet undetected directions. 

Definition 5.5: Let := max., n a (Pj-i), K s ,new := maxj K s (Pj, ne w), := max^ « s (-Dj, ne w,fc), := max.,- k s ((7 - 

Pj,newPj,new )P/,new,/c)? ^s,e ■ — maXj K s (*& K Pj^ . 

Definition 5.6: 

1) Let Dj.fc Ej^Rj.k denote its QR decomposition. Here, is a basis matrix while Rj tk is upper triangular. 

2) Let -Bj,fc,_L be a basis matrix for the orthogonal complement of span(J5j- = span(Dj k). To be precise, Ej tkt ± is a 
n x (n — Cj fc) basis matrix that satisfies Ej k ± f Ej k — 0. 

3) Using Ej,k and £y,fc,x, define A jtk , A jtk ,x, Hj,k, Hj,k,X and B hk as 

— Ej,k^ j,k-\LtLt^ j. k -\Ej. k 



A 



j,k - 



Aj,k,±. — ^ Ej t k,x'^j,k-iLtLt'^j t k-iEj ;kj x 



Hj,k'-=— ^2 E j,k'^j,k-i{etet-Ltet—etLt)^j,k-xE^ k 
Hj,k,± ■= — ^2 E j,k,x'^j,k-i(etet' - L t e t ' - e t L/)* j,k-iEjx± 

- E j,k,-L^j,k-iLtL' t ^ j}k - l E j , k = - Ej,kyVj,k-i(L t -et)(Lt -et'^jt-iEj,, 



B 3,k ■= ■ 



4) Define 



•A.j,k '■ — 



n 



j,k 



Ej,k Ej, k ,x 



Ej :k Ejfc±_ 



Ij.fc.x 

11 1> B 'j,k 
Bj tk Hj :k: ± 



E j,k 
E j,k,X 

E j,k,±' 



(7) 



5) From the above, it is easy to see that 



Aj,k+tij,k = i ®j,k-iL t L' t y jt k-i. 



6) Recall from Algorithm [2] that 



■Aj,k + lij,k = — ^ &j,k-iLtL' t &j,k-i — G Jik Gj, k ,± 



teii. 



Aj,k 
A j)k) _L 



fy 

u i,k,± 



'Notice that < yl — r 2 C 2 ^ a i{Rj,k) by Lemma 1731 therefore, Rj t k is invertible. 
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is the EVD of Aj t k + Tij.k- Here Ak is a Cj t k x c.,-^ diagonal matrix. 
Definition 5.7: Let P^* := Pj-i = Pu-i)- Recall that Pj t * := P( t ._x) = Pj-i- In the sequel, we use the subscript * to 
denote the quantity at t = tj — 1. 

Definition 5.8 (Subspace estimation errors): 

1) Recall that the subspace error at time t is SE( t ) := ||(7 — P( t ) P^)P(t) lb- 

2) Define 

0> :=||(/--P J >^,.)^lla. 

This is the subspace error at t = tj — 1, i.e. £j> = SE^^^. 

3) For k = 0,1,2,- •• ,K, define 

Cj,fc :== II — Pj—^ j—l ~ -^j',new,fc-fj,new,fc)-fj,new||2- 

This is the error in estimating span(Pj new ) after the k th iteration of the addition step. 

4) For k = 1,2,- •• .tfj, define 

Cj,k :=\\(I-Y,GjA,i) G i*h- 

i=l 

This is the error in estimating span(Gj ! / c ) after the k th iteration of the cluster-PCA step. 

Remark 5.9 (Notational issue): Notice that £ is a given scalar satisfying the bound given in Theorem 14.11 while Ci,fc>Cj,* 
and (j.fe are as defined above. Since the basis matrix estimates are functions of the L t 's, which in turn are depend on the Lt's 
and L t = P(f)Ot, thus, Cj,k>Cj,* an d Cj'.fc 316 functions of the a f 's. Thus, Cj,k>Cj,* an d Ci.fc are > m f act > random variables. 

Remark 5.10: 

1) Notice that £,„ = ||£» J>ew || 2 , Q,k = ||-EW,fe||2 and Q,k = \\(I- G k G' k )D^ k h = ll*j,feGj,fe||2- 

2) Notice from the algorithm that (i) Pj }Bev ,,k is perpendicular to Pj t * = Pj-i', and (ii) Gj t k is perpendicular to 

[Gj t i,Gj,2, . . . Gj t k-i\- 

3) For t G Tj.fc, P(t) = Pj = [(Pj-l \ Pj.old), Pj,new], P(t) = [Pj-1 Pj,mw,k] a "d 

^(i) = ~ Pj-lPj-1 - Pj,new,kPj,new,k) P jh < || (J ~ Pj-lPj-i ~ Pj, new, kPj, new, k) ^j>ew]||2 < (j,* + Cj,k 

for k = 1, 2 . . . K. The last inequality uses the first item of this remark. 

4) For t G l jt k, P(t) = Pj, P(t) = [Pj-i Pj.new.ii:] a nd 

SE(t) — SE^ t . +Ka _ij < Cj,* + Cj,K 

5) For i G Zjjf+i, P(t) = Pj = [Gj,i,- ■ ■ , G hi)j ], P (t) = Pj = [G jtl , • • • , G^J, and 

fe=l 

The last inequality uses the first item of this remark. 
Remark 5.11: Recall that e t :— St — St- Notice from Algorithm [2] that 

1) e t = L f ; - i t . 

2) If Ti = T t , then e t = lT t {{®(t))T t ' (®(t))T t ]~ 1 lT t '&(t)P(t)at- This follows using the definition of St given in step Id of 
the algorithm and the fact that ($( t ))' T $(j) = ($( t) / T )' < I'(t) = I'x^it) f° r an Y set T\ Thus, for t £ [tj,t j+1 - 1], 

e t = /T t [($ W )T t '($( 4 ))T t ]^ 1 /T f '$ W P,-a t = ^^(idT^wlT.r'/T^Mfi^l,, + P,,„ewa t ,„ew] (8) 

with 





< G 2"j,fe, 


fc = 


1,2... A' 




i G Ij,fc, 


fc = 


1,2...^ 


*i+l,0 


< G Ij,^ 


+1 
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TABLE I 

Comparing and contrasting the addition proj-PCA step and proj-PCA used in the deletion step (cluster-PCA) 



fc ,h iteration of addition proj-PCA 


fc th iteration of cluster-PCA in the deletion step 


done at t = tj + ka — 1 


done at t = tj + Ka + i9j<5 — 1 


goal: keep improving estimates of span(Pj jnew ) 


goal: re-estimate span(Pj) and thus "delete" span(Pj |0 id) 


compute -Pj inC w,fc by proj-PCA on [Lt : t 6 Tj k ] 
with P = P,_i 


compute Gj k by proj-PCA on [Lt ■ t 6 Z,- 
with P = G^det.fc = [Gj,i. ■ ■ ■ . 


start with || (/ - PjWLP^Pj-iHa < K and Ci.fe-i < Cjti < O- 6 ^ 1 + °- 4 < 


start with ||(/ - G j^uG'.^^G jMtk || 2 < rC and C J:if < cC 


need small gj j, which is the 

average of the condition number of Cov(Pj ncw Lt) averaged over t G Ij,fe 


need small 3j j, which is the 

maximum of the condition number of Cov(G' 7 k Lt) over t 6 X, 


no undetected subspace 


extra issue: ensure perturbation due to span(Gj, un( ] cti fc) is small; 
need small hj }k to ensure the above 


k is the subspace error in estimating span(Pj, new ) after the k th step 


C,j : k is the subspace error in estimating span(Gj : fc) after the k th step 


end with Cj,k < Ck < °- efe + °- 4c f w - h P- 


end with Cj,k < Cj.fcC w - h P- 


stop when k = K with A" chosen so that Cj.K < 


stop when k = "dj and £j k < tjj^C, for all fc = 1, 2, ■ ■ ■ , "dj 


after if" 1 iteration: P (t) «- [Pj_i P/^ew.if] and •SP(t) < (r + c)£ 


after ■Sf b f l iteration: P (t) ^- [Gj,l,- • • .Gj,^.] and 5P (t) < r( 



Definition 5.12: Define the random variable 

Xj,ki,k 2 '•={0,1,0,2,-" i at 3 +fciQ+fe 2 5-i}- 

Recall that at's are mutually independent over i. 
Definition 5.13: Define the set Tj,k lt k2 as follows. 

f j>kfi := {X j<k<0 : Q,k < Ck > and = T * and e * satisfies © for all f g X jik }, k = l,2,...K, j = 1, 2, 3, ... J 
f j,K,fc := {-Xj.K.fc : Cj.fc < Cj.fcCi and = r t and e t satisfies (0) for all £ g Zj,fc}, fc = 1, 2, . . j = 1, 2,3, ... J 
1 ;. r . • i := {X,-+i i0 ,o -T t = T t and e t satisfies © for all £ g Z^.+i}, j = 1, 2, 3, . . . J 

Define the set Tj^fa as follows. 

r lj0 , := {-X"i,o,o : Ci,* < r Ci and it = T t and e t satisfies © for all i g [itrain,*i - 1]}, 
Fj,k,o : = rj.k-1,0 n f j ifci0 , fc = 1, 2, . . . K, j = 1,2,3, ... J 

Ij-.jir.Ai : = r^K.fe-i n r^K.fc, fc = 1,2, . . .t?j, j = 1,2,3, T 

Tj+1,0,0 := ^j,K,-dj nr^if^+i, j = 1,2,3, ... J 

Recall from the notation section that the event Tj ki k2 := {Xj :kl .k 2 G Fj jkltk2 }. 

Remark 5.14: Notice that the subscript j always appears as the first subscript, while fc is the last one. At many places in 
this paper, we remove the subscript j for simplicity. Whenever there is only one subscript, it refers to the value of fc, e.g., $o 
refers to $_y,o, -Fnew.fc refers to Pj,new,fe an d so on. 

B. Proof Outline of Theorem \4.1\ 

The first part of the proof that analyzes the projected CS step and the addition step is essentially the same as that in l2D . 
The only difference is that, now, Ct = instead of = (rn, + (j — l)c)C- I n Lemma [67T1 the final conclusions for this part 
are summarized: it shows that, for all fc = 1, 2, . . . K, Q k decays roughly exponentially with k and it bounds the probability 
of T^ k q given Tj k _ 1Q . The second part of the proof analyzes the projected CS step and the cluster-PCA step. The final 
conclusion for this part is summarized in Lemma ET21 it bounds the probability of K k given Vj K k _ 1 . Theorem 14. 1 1 follows 
essentially by applying Lemmas 16.21 and 16.11 for each j and fc and using Lemma 11.51 

Lemma 16.21 in turn, follows by combining the results of Lemma 17.21 (which shows exact support recovery and bounds the 
sparse recovery error for t g Ij tk conditioned on K k _i), and Lemma I7T81 (which bounds the subspace recovery error at 
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the k th step of cluster-PCA conditioned on K k Lemma [7721 uses the result of Lemma [TTI which bounds the RIC of 
in terms of £*, ^ and the denseness coefficients of P* and P n ew Lemma [7781 is obtained as follows. In Lemma [7741 we 
show that, under the theorem's assumptions, ( k < Cj^C- In Lemma 17761 we bound Qk in terms of X m i n (Ak), A ma x C^fc.J-) and 
H^fclh using Lemma [1.111 Next, in Lemma 17771 (i) we use Lemma 17721 and the Hoeffding corollaries (Corollaries ll.6l and II. 1\ 
to bound each of these terms and (ii) then we use Lemma 17.61 and these bounds to bound C,k by Q k with a certain probability 
conditioned on r| K - fe _ 1 . Finally, Lemma 17781 follows by combining Lemma [774] and Lemma [7771 

C. Connection with Addition proj-PCA 

Our strategy for analyzing cluster-PCA and hence for proving Theorem 14. II is a generalization of that used to analyze the 
k th addition proj-PCA step in [21). We discuss this in Table [J 

VI. Proof of Theorem I4.1I 
The theorem is a direct consequence of Lemmas 16.11 and 16.21 given below. 

A. Two Main Lemmas 

The lemma below is a slight modification of ||2~T1 Lemma 40]. It summarizes the final conclusions of the addition step. 
Lemma 6.1 (Final lemma for addition step): Assume that all the conditions in Theorem 14.11 holds. Also assume that 
P(r,Vi,o) > 0. Then 

1) (+ =!,(+< 0.6 fc + 0Ac( for all k = 1, 2, . . . K; 

2) P(q M |q fc _i,o) > P*(«, > PK(a, C) for all k = 1, 2, . . . K. 

where Q" is defined in Definition 15. 3 1 and pk(ct, C) is defined in fl2Tl Lemma 35]. 

The proof of the above lemma follows using the exact same approach as in the proof of Lemma 40 of [21] but with = r( 

instead of (tq + (j — l)c max )£ everywhere. We give the proof outline in Appendix lAl 

The lemma below summarizes the final conclusions for the cluster-PCA step. It is proved using lemmas given in Sec IVIII 
Lemma 6.2 (Final lemma for cluster-PCA): Assume that all the conditions in Theorem 14.11 hold. Also assume that 

P ( r i,tf,fe-i) > 0. Then, 

1) for all fe= 1,2, ... &j, P(T e jKk | r| jK>fc _ 1 ) > p(a, C) where p(a, C) is defined in Lemma ED 

2) P(r|+i i0 ,o I T Ik,^) = i- 

Proof: Notice that P(T e jKk | T^^) = P(C fc < £ k ( and f t = T t , and e t satisfies © for all * e I j>k \ T^ tKtk _x) 
and P(Tj +1 | Tj K fl.) = P(T t = T t and e* satisfies (|SJ) for all t € I^^.+i). The first claim of the lemma follows by 
combining Lemma 17.81 and the last claim of Lemma 17.21 both given below in Sec IVIII The second claim follows using the 
last claim of Lemma 17.21 ■ 
Remark 6.3: Under the assumptions of Theorem 14. II it is easy to see that the following holds. 

1) For any k = 1, 2 . . . K, T e jkQ implies that (i) Q,* < C + := < and (ii) Q jk , < 0.6 fe ' + 0.4cC for all k' = 1, 2, ... k 

• (i) follows from the definition of T e - k and < J2k=i Cj-i,fc' ^ 2~2k=i ^j-i.fe'C — r j-iC < r C = C*~> and (ii) 
follows from the definition of k and the first claim of Lemma 16.11 

2) For any k = 1, 2 . . . + 1, P, /x implies (i) Cj> < C + , (") Cj.fe' < 0.6 fe ' + 0.4cC for all A;' = 1, 2, . . .K, (iii) Cj.x < <, 
(iv) H^jfP^lla < (r + c)C, (v) Cj.fe' < Cj.fc'C for k' = 1,2, . . .k and (vi) =1 6,* ^ r X < 

• (i) and (ii) follow because K C fc , (iii) follows from (ii) using the definition of K, (iv) follows from (i) 
and (iii) using H^/r-Pjlh < ||$i,i<r[-Pj,*> Pj,new]||2 < Cj,* + Cj,K, and (v) follows from the definition of K k . 

3) T e J+h00 implies (i) £,> < C + for all j, (ii) £?,fc < 0.6 fc + 0.4cC for all k = 1, • • • , K and all j, (iii) Q,k < c( for all j. 
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B. Proof of Theorem \4.1\ 

The theorem is a direct consequence of Lemmas 16.11 and 16.21 and Lemma 11.51 

Notice that T^ fi D ' i ' ' ' 2 q K>0 D ^j,K,i 3 ' j.K.J ' ' ' 3 P]^ D q +li0)0 . Thus, by Lemma 

O P(r 3 e +1 , ,oiq ,o) = P(r| +1 ,o,o|r^, tf )nLiP(q^ fe |r J V ) fe-i)nf= 1 P(q fe ,oiq fe _ 1)0 ) and p(r J+1 ,o, |r ll() ,o) = 

rij-i P(r|+i,o.ol^j.o,o)- Using Lemmas 16.11 and 16.21 and the fact that pk(a, C) > Pk(oi,C,) lES Lemma 35], we get 
P(r} +1 |ri )0 ,o) > C)^ J p(<5, C)' ?max ' 7 - Also, P(rf ) = 1. This follows by the assumption on Pq and Lemma 

E2 Thus, P(r e J+1A0 ) >PK(a 1 KJ P{aXY— J - 

Using the definitions of a ad d(C) and «dei(C) and a > "add and a > a de i, P(r} +10 ) > Pk (a, () KJ p(a, Q^^J > 
(1 - n- 10 ) 2 > 1 - 2n- 10 . 

The event T e J+1 implies that T t — T t and et satisfies ([8]l for all t < tj+±. Using Remark [5.101 and the third claim of 
Remark |6~3l T e /+1 implies that all the bounds on the subspace error hold. Using these, Remark 15. Ill || at, new || 2 < \/cinew,k 
and \\a t \\2 < \A"7*> 00 implies that all the bounds on 1 1 1 1 2 hold (the bounds are obtained in in Lemmas 17.21 and \A.2\ . 

Thus, all conclusions of the the result hold w.p. at least 1 — 2n~ 10 . 

VII. Lemmas used to prove Lemma IOI 
In this section, we remove the subscript j at most places. The convention of Remark [5. 141 applies. 

A. Showing exact support recovery and getting an expression for et 
Lemma 7.1 (Bounding the RIC 0/ The following hold. 

1) <^$ ) = «f(P*)<<* + 2C* 

2) <5 s ($ fe ) = P„ew,fe]) < k*(P.) + K2(P new , fe ) < + 2 ^ + ( Ks new + K Sjfe Cfe + C*) 2 for k = 1, 2 . . .if 

Proof: The above lemma is the same as the last two claims of ET1 Lemma 28]. It follows using Lemma l2~4l and some 
linear algebraic manipulations. ■ 
Lemma 7.2 (Sparse recovery, support recovery and expression for et): Assume that the conditions of Theorem 14. 1 1 hold. 

1) For all k = 1, 2, ...'&+ 1, Xj t K,k-i G ^j,K,k-x implies that 

a) C* < C+ := K, < <<, UM/lh < (r + c)£, 

b) 5 a ($j<-) < 0.1479 and </> K < 0+ := 1.1735 

c) for any t G Z^fc, 

i) the projection noise /3 t := (7— P^i^PL^Lt satisfies \\/3t\\2 < \/C 

ii) the CS error satisfies ||ot >os — Stlh < 

iii) f t - Tt, 

iv) e t satisfies ([8]) and 1 1 1 1 2 < <fi + y/(. 

2) For all k = 1, 2, . . .1? + 1, P(T t = T t and e t satisfies © for all t G %, k \X jtK ,k-i) = 1 for all X ,. KJ , € r iiJf , fc _i. 

3) For all k = 1, 2, . . .1? + 1, P(T t = T t and e t satisfies © for all t G 2,-,* |r| K fc _ 1 ) = 1. 
Proof: 

Claim 1-a follows using Remark 1631 Claim 1-b) follows using claim 1-a) and Lemma [77X1 Claim 1-c) follows in a fashion 
similar to the proof of lETI Lemma 30]. The main difference is that everywhere we use ^xLt = Qi<Pj a t and ||3>#P,-||2 < 
(r + c)£. Claim 1-c-i) uses this and the fact that for t G ij,k< ®(t) — ®K> and \fQ < \Jytj(r + c) 3 . Claim 1-c-ii) uses c-i), 
VC < £ (defined in the theorem), #2 a ($.K") < 0.1479, and Theorem 11.131 Claim 1-c-iii) uses c-ii), the definition of p, the 
choice of cj and the lower bound on S m in given in the theorem. Claim 1-c-iv) uses claim c-iii) and Remark [5.111 To get the 
bound on ||et||2 we use the first expression of (|8}, 4>k < 4> + '■= 1.1735, and \/C < ^/7*/( r + c ) 3 - 

Claim 2) is just a rewrite of claim 1). Claim 3) follows from claim 2) by Lemma [T~4l 
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B. A lemma needed for bounding the subspace error, Q k 

Lemma 7.3: Assume that Q k i < Ck'C f° r k' = 1, - ■ ■ , k — 1. Then 

1) HAteVslla = ||*fc-iG de ,,fe||2 < rC 

2) ||Gdet,fcGdet,fc' — Gdet,fcG detifc ||2 < 2r£. 

3) < y/l - r 2 C 2 < o-i(Dfc) = ^(i? fe ) < 1. Thus, ||D fc || 2 = \\R k \\ 2 < 1 and \\D^% = \\R k % < 

4) H-Dundetfe'-Efclb = HGundet./c'-Efclh < ~ 7=^=5- 

Proof: The first claim essentially follows by using the fact that G-y,- ■■ , Gk-i are mutually orthonormal and triangle 
inequality. Recall that = (I — Gd ety kG' Ae . t k ). The last three claims use this and the first claim and apply Lemma ["1.121 

The last claim also uses the definition of D k and its QR decomposition. The complete proof is given in Appendix [B] ■ 



C. Bounding on the subspace error, C, k 
Lemma 7.4 (Bounding Ck + )■' If 

r i~ 7 \ fine (<?max j ''max) „ , m 
/dec(.flmax, "max) = J > (9) 

then fdec(h,hk) > and < c k (. 

Proof: Recall that / !nc (.), fdec(-) are defined in Definition 15 . 3 1 and ( k + :— f* nc (9,h) ^ jSTotice that /j nc (.) is a non-decreasing 
function of g, h, and fdec(-) is a non-increasing function. Using the definition of <? max , fr max , c lrml given in Assumption 12.51 
the result follows. ■ 

Remark 7.5: If we ignore the small terms of fi nc (.) and fdec(-), the above condition simplifies to requiring that 
3«„, e gmax+K s , e _» ^ frpj- ■ Since 5 max > 1, the first term of the numerator is the largest one. To ensure that this 
condition holds we need K+ e to be very small. However, as explained in Sec IVII-DI if we also assume denseness of 
D k , i.e. if we assume n s (D k ) < kT d for a small enough k~T d , then the first term of the numerator can be replaced by 
max(3K,f e K~^ D 4> + g max , K,f e (f> + h max ) . This will relax the requirement on K+ e , e.g. now nf e = D = 0.3 will work. 

Lemma 7.6 (Bounding ( k ): If X m i n {A k ) - X max (Ak,x.) - WHkh > 0, then 

d < = ^ — (10) 

Xmin(Ak) — A m ax(^4/c,J_) — ||"Hfc||2 



Proof: Recall that A k , A k _j_, H k are defined in Definition 15.61 The result follows by using the fact that (fc = ||(J — 
G k G' k )D jtk \\ 2 = - G k G' k )E k R k \\ 2 < - G k G' k )E k \\ 2 and applying Lemma HH] with E = E k and F = G k . ■ 
Lemma 7.7 (High probability bounds for each of the terms in the C, k bound and for C, k ): Assume that the conditions of 
Theorem |4H hold. Also, assume that P{T e j K > 0. Then, for all 1 < k < 

1) P(A min (i fc ) > Afc (1 - r 2 C 2 - O.lOlr^fc^) > 1 - pi (5,0 with pi (5,0 given in ®. 

2) P(A max (i fe , ± ) < \ k (h k +r 2 C 2 / + 0.lO|r^ fc _ 1 ) > l-p 2 (a,0 with p 2 {a,Q given in O- 

3) P(||^fc||a < Kfino{9kM) \^l K ,k-x) > 1 -P 3 (5,C) with p 3 (a,0 8 iven in <ES1>- 

4) P(A min (i fc ) - A max (i fe ,x) - ||W fc || 2 > X k fdec(9kM) \Tl K ,k-i) >P(«,0 : = 1 - - Pa(&, - Pa(5, 0- 

5) If / dec (ff fc , &*) > 0, then P(d < C fc + l r W-i) > P(5. 

Proof: Recall that fi nc (.), fdec(-) an d are defined in Definition 15.31 The proof of the first three claims is given in Sec 
IVII-DI The fourth claim follows directly from the first three using the union bound on probabilities. The fifth claim follows 
from the fourth using Lemma 17.61 ■ 
Lemma 7.8 (High probability bound on ( k ): Assume that the conditions of Theorem 14. 1 1 hold. Then, 

P(d<£feC |r,V^i)>p(5,o 

Proof: This follows by combining Lemma 17.41 and the last claim of Lemma 17.71 ■ 
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D. Proof of Lemma \ 7.7\ 

Proof: We use i £ t to denote i Ete± A »- 
For £ 6 I^fe, let a tfe := Gj, k 'L t , a-tAei := G det , k 'L t = [Gj^,---Gj.k-i\'L t and a tundet := G un d e t,fc'^t = 
[Gj, fe+ i ■ • ■ Gj.^'Lt. Then a t := P^L t can be split as a t = [aj det undet ]'. 
This lemma follows using the following facts and the Hoeffding corollaries, Corollary 11.61 and 11.71 

1) The statement "conditioned on r.v. X, the event £ e holds w.p. one for all X G T" is equivalent to "P(£ e \X) = 
1, for all X E T". We often use the former statement in our proofs since it is often easier to interpret. 

2) The matrices D k , R k , E k , Aiet,fc, -Dundet.fc, ^fc-i, $k are functions of the r.v. Xj,K.k-i- All terms that we bound for 
the first two claims of the lemma are of the form i Yltex k where Z t = /i(^j,K,fc-i)^/2p^i,j<",fc-i)> Y t is a sub- 
matrix of a t a' t and /i(.) and f 2 (.) are functions of X,jK,k-i- For instance, one of the terms while bounding \ m i n (Ak) 

3) Xj } K ,k-i is independent of any a t for t <E Ij ik , and hence the same is true for the matrices D k , Rk, E k , £>det,fc> Amdet.fc, 
^k-i> $k- Also, at's for different t £ Ij ik are mutually independent. Thus, conditioned on Xj t K,k-i> the ^t' s defined 
above are mutually independent. 

4) All the terms that we bound for the third claim contain et- Using the second claim of Lemma 17721 conditioned on 
Xj t K,k-i, et satisfies © w.p. one whenever Xj t x,k-i € ^j,K,k-i- Conditioned on Xj t K,k-i, all these terms are also of 
the form i Yltei ■ k ^ t w * trl ^* as defined above, whenever Xj ^ k-i € ^j,K,k-i- Thus, conditioned on Xj K,k-i, the 
Z t 's for these terms are mutually independent, whenever Xj t K t k-i G ^j,K,k-i- 

5) By Remark|631 -Xj )K) fc-i e i implies that (.* < r(, Ck' < Cfc'C, for all k' = 1, 2, . . . k - 1, ( K < < C C> ( iv ) 
0if < + (by Lemma 1772] ); (v) H^if-Pjl^ < (r + c)(; and (vi) all conclusions of Lemma 177731 hold. 

6) By the clustering assumption, < A min (E(a t , fe a t , fe ')) < A max (E(a t , fe a t , fe ')) < A^; A max (E( 

fli,det a t,det 

)) < A+ = A+; 

and A max (E( 

"■t , undet Q>t , undet max 

(E(a t aJ)) < A + . 

7) By Weyl's theorem, for a sequence of matrices B t , \ m in(Y<t B t) > X) t A mi „(5 t ) and A max (X t £t) < J] t A max (B t ). 
Consider A k = i £ t E k '^ k -iL t Lt^ k -iE k . Notice that E k '^ k ^L t = R k a t . k + E k \D det 

.fcflt.det + i'undet.fcQ't, undet)- Let 

Z t = R k a tyk at,k' Rk' and let F t = i?^a t ^(a^det'Ajet.fc' + a^undet'Amdet.s/)-^ + -Efc(Aiet,S: a i, det + Amdet,fc a t,undet)atVi?fc'. Then 

4^V2 f + -Vy f (ii) 

t t 

Consider \^ t Z t = |^] t R k at :k at. k ' Rk ■ (a) As explained above, the Z t 's are conditionally independent given 
Xj,K,k-i- (b) Using Ostrowoski's theorem and Lemma [773] for all X jtK ,k-i € r^fe-i, A min (E(i £) ( Z t \Xj, K ,k-i)) = 
X min (R k ^ t -E(a ttkat ,k')Rk') > X min (R k R k ')X min (l J2t E(a t , fe a t , fc ')) > (1 -r 2 C 2 )A". (c) Finally, using ||JJ fc || a < 1 and 
1 1 «t,fc 1 1 2 < Vc^7*, conditioned on X jtK ,k-i, X Z t X c fc 7 2 / holds w.p. one for all X jtK ,k-i e r^K.fc-i- 

Thus, applying Corollary II .61 with e = 0.1£A~, and using c k < r, for all Xj t K,k-i € Tj^^-i, 



:| E Stf > (1 - ^ 2 C 2 )\T - O.ia .V, K , > 1 - c k exp(-^|^) > 1 - rexp(- 5 ' \ 



Consider Y t = R k at,k{ a t,det' D dettk ' + a ttimAe / D undettk ')E k + E k (D Ae ^ k a t: d et + D undettk a ttUnAet )at,k' R k ' ■ (a) As before, the Y t 's 
are conditionally independent given Xj y K,k-i- (b) Since E[a t ] = and Cov[a f ] = A f is diagonal, E(^ J^t Yt\Xj ,K.k-i) = 
whenever X^ K>k _ x E T^k^-i- (c) Conditioned on Xj.x.fc-i, ||y t || a < 2^ 7 2 rC(l + ^== ) < ^ 2 (^ (1 + ^"^.J < 
2(1 + iQ" ) < 2.1 holds w.p. one for all X~ k k-i ^ k k-i- This follows because X,< k k-i ^ T 7 ^ / c _ 1 implies that 
||Afet,fc||2 < ||^fe'-D U ndet,fe||2 = 1 1 ^fc'G un det,fc 1 1 2 < 7= C = ■ Thus, under the same conditioning, -bl < Y t ^ bl with 
b = 2.1 w.p. one. Thus, applying Corollary 11.61 with e = 0.1CA~, we get 

P(A min (4 Vy t ) > -0.1CX-\X jtK , k -i) > l-rcxp(- Q( °: 1CA ) ) for all A,- K , fc -i € I , k> : (13) 
a ^ 8(4.2) 2 
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Combining (fTTT) . (ITDl and (O and using the union bound, P(A m i n (yl/ c ) ^ A^, (1 — r^£^) — 0.2£A t K,k— 1) ^1 — 
Pi(5,C) f or all Xj !K!k -i G r 3 ,K,fe-i where 

- /- ^ / a-(O.lCA-) 2 , , 5(0.1CA-) 2 , 

Pl (a,C) := rcM ^T^) + ^xp(- ) (14) 

The first claim of the lemma follows by using A,7 > A - and applying Lemma [T~4l with X = Xj^k—i and C = r^jf.fc-i- 
Consider A ki± := jEt V**-iWVA,i. Notice that E kj ±^ k -iL t = E K± '(D del 

.kO-t ,det + -^undet, fe &i .undet ) ■ 

Thus, Afc,j_ = i^t^t with Z 4 = £fc,j_'(-Ddet,fcat,det + Aindet,fcat,undet)(Aiet,fcat,det + £> vmdet)fe at )Un det) / Sfc,x which is of size 
(n — Cfc) x (n — c k ), (a) As before, given Xj y K,k-i, the Z t 's are independent, (b) Conditioned on Xj t K,k-i> ^ Z t < tj^I 
wp. one for all A ;Kf , , G r,., v . A ,. (c) E(± £ 4 Z t |X,- d (Aj£ +1 + r 2 C 2 A+)7 for all \,. lxJ , G 1 ,. 
Thus applying Corollary 1 1 . 6 1 with e = 0.1£A~ and using c k > c m i n , we get 

P(A maj£ (i fe ,±) < A+ +1 +r 2 C 2 A+ + 0.1CA-|X,-^-i) > l-p 2 (a,C) for all X, 

,K,k-l t 1 j.K.k-1 

where 

- ^ / \ , a(o.icA-) 2 ^ 

M a > C) : = {n - Cmin) exp( ^T~i — ) ( 15 ) 



The second claim follows using A fe > A , / := A + /A , h k := X k +i + /\ k in the above expression and applying Lemma [T~4l 
Consider the third claim. Using the expression for T~L k given in Definition 15.61 it is easy to see that 

\\iikh <max{||ff fc || 2 , \\H k ,±_h} + \\B k \\ 2 < \\l Ve t e t '|| 2 + max(||T2|| 2 , ||T4|| 2 ) + ||B fc || a (16) 

rv * — » 



a 
t 



where T2 := &£ t Efc'tf fc _i(W + e t L/)*fc-i£ fc and T4 := i £ t £fc,x'*fe-i(W + e t 'L t )*fc-i£k,±- The second 
inequality follows by using the facts that (i) H k =Tl- T2 where Tl := A J2 t E k ^ k -ie t e t '^ k -xE k , (ii) H ky± = T3 - T4 
where T3 := A £ t f; fc ,±'* fe -ie t e t '* fe _ii; fe , ± , and (iii) max(||Tl|| 2 , ||T3|| 2 ) < || A £ t e t e t '\\ 2 . 

Next, we obtain high probability bounds on each of the terms on the RHS of ( fl~6l ) using the Hoeffding corollaries. 

Consider || A J^t e t^t\ 2 - Let Z t = e t e t '. (a) As explained in the beginning of the proof, conditioned on Xj t K,k-i, the 
various Zt's in the summation are independent whenever Xj K,k-i G ^j,K,k-i- (b) Conditioned on Xj K,k-it ^ Z t X ^7 
w.p. one for all .V,. K ., , G r, K> :• Here 6i := 0+ 2 C. (c) Using H^P^ < (r + c)C, ^ ^ Et E (^l^,^,fc-i) ^ 
6 2 7, 6 2 := (r + c) 2 C 2 </>+ 2 A+ for all X,-, x )fc _i G r^^fe-i. 

Thus, applying Corollary 11.61 with e = 0.1£A _ , 

P(l|4 ^etet'h < &2 + 0.1CA-|X J - if , fe -i) > 1 - nexp(-^H^-^) for all X,- jJf|fc _i € r„ K ,fc-i (17) 

Consider T2. Let Z t := E k '^ k -x{L t et' + e t L t ')^ k _ x E k which is of size c k x c k . Then T2 = A£ t Z t . ( a ) 
Conditioned on Xj^K,k-u the various Zt's used in the summation are mutually independent whenever Xj t jc,k-i G 
rj,K,fe-i- (b) Notice that E k 'V k -iL t = R k a ttk + E k {D det . k a t .M + D miettk a t , m det) and E k '^ k -ie t = (R^ 1 )' D' k e t = 
{R k l )'D' k I T A( ( $'K)'T t {®K)T t \~ 1 lT t ''$>KPjat- Thus conditioned on Xj 

,K,k-i, \\Zt\\2 < 26 3 w.p. one for all Xj t K,k—i G 
Here, 6 3 := ^ + 7*- This follows using UR^Yh < l/^l-^C, \\e t \\ 2 < and \\E' k ^ k ^L t \\ 2 < 



». HI I 

Thus, applying Corollary II. 71 with e = 0.1£A~, for all Xj t K,k-i G r^^fc-i, 



|A|| 2 < V^7*- (c) Also, ||iE*E(^|A> )JC)fc _i)|| 2 < 2b 4 where & 4 := K a , e (r + c)C0+(A+ + <A+ + -0==\+ +1 ). 



P(||T2|| 2 <26 4 + 0.1CA-|X i>ir>fe _ 1 ) > l-c fc exp(- a( " 2 1C 4 A ^ 



Consider T4. Let Z t := E ktX '^ k _i(L t e t ' + etL t ')^ k _ x E k ^ which is of size (n~c k ) x (n — c k ). Then T4 = i E t ^t- (a) 
conditioned on Xj % k-it the various Z t 's used in the summation are mutually independent whenever Xj K,k-i G Tj ^ ,fc_i. 
(b) Notice that Ek^'^k^Lt = E k ^±' (D d ^ k a ttdet + -D undetifc a tiUnd et). Thus, conditioned on Xj,K,k-i, \\Zth < 26 5 w.p. one 
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P(||T4|| 2 < 2b 4 + 0.1CA-IA,- > 1 - (n - c fc ) exp( \ 2 ^' ) 



^2 



P(max(||T2|| 2 , ||T4|| 2 ) < 2& 4 + 0.1CA- |X,- > 1 - nexp( ^ ) (18) 



for all X, hi , i G r i)Kjfc _i. Here 6 5 := \/r<V> + 7*- ( c ) Also > for a11 ^-j,K,k-\ G r^K.fc-i, II s Et E (-^t|^i,^,fc-i)l|2 < 
2& 6 , 6 6 := K s . e (r + c)C0 + (A,t +1 + r(X + ). Applying Corollary 11.71 with e = 0.1CA~, for all Xj t K, k -i e Fj,K,k-i> 

P(||T4|| 2 < 26 6 + 0.1CA-|X^, fc -i) > 1 - (n - gfc) exp(- ^^ ) > 1 - (n - c min ) exp(- ^^ ) 

Consider max(||T2|| 2 , ||T4|| 2 ). Since 63 = 65 and 64 > £> 6 , so 2& 6 + e < 26 4 + e. Therefore, for all XjK k _ x G Ij ^ 

<5(0.1CA_) 2 

By union bound, for all X^K,k-i € r^x.A-i, 

a(O.lCA-) 

Notice that if we also introduce an extra denseness coefficient k s ,d '■= niaxj max/c K s (D k ), then P(||T2|| 2 < 2k s .£>&4 + 
O.lCA-lXj.jc.k-i) > l-g fc exp(- a(0 32 ic 4 ^ )2 ). Thus, P(max(||T2|| 2 , ||T4|| 2 ) < 2max{ Ks , D b 4 , b 6 ) + O.K>r\X jt K ,k-i) > 
1 — ncxp(- °^ 4 ^ ). This would help to get a looser bounds on g max and h max in Theorem 14. II 

Consider ||J5 fc || 2 . Let Z t := E k ,±'^k-i{L t - e t ){L t ' - e t ')^ k -iE k which is of size (n-c k ) x c k . Then B k = i E t Z t- ( a ) 
conditioned on Xj : K,k-i> the various Z t 's used in the summation are mutually independent whenever X^K,k-i G Tj t K,k-i- 
(b) Notice that E k ,x^k-t(L t - e t ) = E kt x (D^ %k a t ^ t + D anAettk a t ,nnd e t ~ *fc-ie t ) and E k '^ k -i(L t - e t ) = R k a t , k + 
Sfe'(£> detlfe a i)de t + -D U ndet,feat,undet-*fe-iet). Thus, conditioned on Xj tK ,k-i> \\ z th < b 7 w.p. one for all X^.fe-i G r^fe-i- 
Here & 7 := (^7* + ^ + V() 2 - (c) || i Et E(Z t |X i)jr ,fc-i)||2 < &s for all X^.fc-i G r JiJr , fc _i where 

b 8 := (r + c)C^. e 0+A+ + [(r + c)(k s ^+ + (r + c)<X, e Z-L— =]A+ +1 [r 2 ( 2 + 2(r + cK 2 K s , e </> + + (r + C ) 2 C 2 4^ +2 ]A + 

V 1 — 

Thus, applying Corollary 1 1 . 7 1 with e = 0.1£A~, 

P(||S fc || 2 < b s + 0.1. A .V ; K / . ,i > 1 - nexp(- Q( ° 2 1C ^ ) ) for all X jtKtk - X G IW-i (19) 
Using (H6), G3, CI and dT9j and the union bound, for any Xj^K,k—i £ r^icfe— 1» 

P(||^fc||2<69 + 0.2CA-|A- iiJfjfc _i) >l-p 3 (5,0 

where 69 := 6 2 + 264 + 6§ and 

~ 9 - 9 ~ 9 

p 3 (a,C) riexp(-^-^) +Tiexp(~ 32 +nexp(-^-^) (20) 

with 64 = 0+ 2 C, 63 := VK<f> + l*, b 7 := (^7* + 0+VC) 2 - Using A" > A", / := A+/A", g k := A+/A~ and h k := A+ +1 /A^, 
and then applying Lemma IT~4l the third claim of the lemma follows. ■ 

VIII. Simulation experiments 

1) Data Generation: The simulated data is generated as follows. The measurement matrix Ai t :— [Mi, M 2 , ■ • • , M t ] is of 
size 2048 x 5200. It can be decomposed as a sparse matrix St := [S'i,5 2 ,--- , St] plus a low rank matrix £ t '■= [-^n , La > • ' " 

The sparse matrix 5t := S* 2 , ••• ,St] is generated as follows. For 1 < t < t trd { n — 200, St = 0. For t aa i n < t < 5200, St 
has s nonzero elements. The initial support To = {l,2,...s}. Every A time instants we increment the support indices by 1. 
For example, for t G [ttrain + 1, ftrain + A - 1], T t = T , for t G [ttrain + A, ttnun + 2A - 1], T t = {2, 3, . . . s + 1} and so on. 
Thus, the support set changes in a highly correlated fashion over time and this results in the matrix St being low rank. The 
larger the value of A, the smaller will be the rank of St (for t > t trd \ n + A). The signs of the nonzero elements of St are ±1 
with equal probability and the magnitudes are uniformly distributed between 2 and 3. Thus, S m i B = 2. 

The low rank matrix C t '■= [Li,L 2 ,--- ,L t ] where L t :— P(t) a t is generated as follows: There are a total of J = 2 
subspace change times, ti = 301 and t 2 = 2501. r = 36, Ci. new = c 2 ne „ = 1 and ci ^ = c 2 ^ = 3. Let U be an 
2048 x (ro + ci ne „ + c 2 . new ) orthonormalized random Gaussian matrix. For 1 < t < ti — 1, Pm — Pq has rank ro with 
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Pa = ^[1,2,- ,36]' For h < t < t 2 - 1, P( t ) = P\ = [Pa \ Pi,oid P,new] has rank r x = r + ci inew - ci :0 i d = 34 with 
-Pi,new = ^[37] and A,oid = ^[9,18,36]- For t > h, P(t) = Pa = [Pi \ P^oid P2,new] has rank r 2 = n + c 2 , n ew - c 2 ,oid = 32 with 
Pz.new = t^[38] and -Pioid = t^[8,i7,35]- a * * s independent over t. The various (a t )i's are also mutually independent for different 
i. For 1 < t < t\, we let (a t )j be uniformly distributed between — 7^ and 7^4, where 



7i,t 



400 


if i 


= 1,2," 


,9,Vt, 


30 


ifi 


= 10,11, 


••• ,18,Vt. 


2 


ifi 


= 19,20, 


■■■ ,27,Vt. 


1 


ifi 


= 28,29- 


•• ,36,Vi. 



(21) 



For ti < t < i 2 , a t; * is an ro — Ci i0 ia length vector, a t!llew is a ci jnew length vector and L t := P<t)dt = Pi<H = (-Po \ 
fi,oid)at,*,ri2 + f'l.newfflt.new Now , (ot,*,nz)i is uniformly distributed between -7^ and 7^ for i = 1, 2, ■ ■ ■ ,35 and a t , ne w is 
uniformly distributed between — 7 new ,t and 7 ne w,t, where 



H.t - 



7new,t — 



400 


ifi 


= 1,2,-.. 


8,Vt, 


30 
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ifi 


= 9,10,-- 


, 16Vt. 
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ifi 


= 17,18,- 


• ,24,Vt. 
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ifi 


= 25,26,- 


• ,33,Vf. 


fl.l fe - 


-1 


if t 1 + (k-l)a<t<t 1 + ka 


jl.l 4 


-1 _ 


1.331 ifi>ii+4a. 



(22) 



For t > t 2 , at,* is an n -c 2i0 id length vector, a t , m w is a c 2! „ ew length vector and L t := P( t )a t = P 2 a t = [Po\Pi, id A.newK,* + 
P2,newa*,new Also, (at,*)i is uniformly distributed between —7^ and 7^ for i = 1, 2, • • • , n — c 2;0 id and a^ ne w is uniformly 
distributed between — 7 new ,t and 7 new .t where 

400 if i = 1,2,-- ■ ,7,Vi, 
30 if i = 8,9, ••• ,14,Vi. 
2 if i = 15,16,- •• ,21, Vt 



7*,* : 



(23) 



7new,t ; 



1.331 if i = 22, Vt. 

1 if i = 23,24,- •• ,31, V*. 

l.l*- 1 if t 2 + {k - l)a < t < t 2 + ka - 1, k = 1, 2, • • • , 7, 

l.l 7 - 1 = 1.7716 ift>t 2 + 7a. 



(24) 



Thus for the above model, 
get the clusters of {1, 2, • • • , i 



in = 2, 7, = 400, 7 „, 
is as follows. 



1, A+ = 53333, A" 



0.3333 and / := Al = 1.6 x 10 5 . One way to 



, 34}. Thus, 



1) For h<t <t 2 with 3 = 1, let = {1, 2, • • • , 8}, G U2) = {9, 10, • • • , 16} and £?i,(3) = {17, 18, 
ci,i = ci, 2 = 8, ci, 3 = 18, ^1 = ^2 = 1, g j>3 = 4, ftj,! = 0.0056, h j>2 = 0.0044. 

2) For t > t 2 with j = 2, let = {1,2,- •• ,7}, £?i, (2 ) = {8, 10, • • • ,14} and g h{3) = {17, 18, • • • ,32}. Thus, 
c M = ci, 2 = 7, ci, 3 = 16, = 5j, 2 = 1, <?j- 3 = 4, % ! = 0.0056, hj, 2 = 0.0044. 

3) Therefore, g max = 4, ft, max = 0.0056 and c min = 7. 

We used £ ttraia + -A/t traill as the training sequence to estimate P . Here A/i lrlun = [ATi, N 2 , ■ ■ ■ , -/V ttrain ] is i.i.d. random noise 
with each (iVt)j uniformly distributed between — 10 -3 and 10~ 3 . This is done to ensure that span(P ) ^ span(P ) but only 
approximates it. 
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Fig. 4. ro = 36, s — maxt \T t \ = 20 and A = 10. The times at which PCP is done are marked by red triangles in (b). 
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Fig. 5. ro = 36, s — maxt \Tt\ = 20 and A = 50. The times at which PCP is done are marked by red triangles in (b). 



2) Results: For Fig. 0] and Fig. [5] we used s — 20. We used A = 10 for Fig. |4] and A = 50 for Fig. [5] Because of the 
correlated support change, the 2048 x t sparse matrix St = [Si, S2, • • • , St] is rank deficient in either case, e.g. for Fig. |4] St 
has rank 29, 39, 49, 259 at t = 300, 400, 500, 2600; for Fig. El S t has rank 21, 23, 25, 67 at t = 300, 400, 500, 2600. We plot 

no c 1 I 

the subspace error SE( t j and the normalized error for St, |fs t ||' avera g e d over 100 Monte Carlo simulations. 

As can be seen from Fig. [4] and Fig- El the subspace error SE( t ) of ReProCS and ReProCS-cPCA decreased exponentially 
and stabilized. Furthermore, ReProCS-cPCA outperforms over ReProCS greatly when deletion steps are done (i.e., at t = 2400 
and 4600). The averaged normalized error for St followed a similar trend. 

We also compared against PCP J2]. At every t — tj + Aka, we solved (Q]i with A = 1/ ^/max(n, t) as suggested in 
121 to recover St and £ t . We used the estimates of St for the last 4a frames as the final estimates of St- So, the St for 
t = tj + l,...tj+4ais obtained from PCP done at t = tj + 4a, the St for i = t, + 4a + 1, . . . tj + 8a is obtained from PCP 
done at t = tj + 8a and so on. Because of the correlated support change, the error of PCP was larger in both cases. 

We also plot the ratio at the projection PCA times. This serves as a proxy for ^(-Dj.new.fc) (which has 

exponential computational complexity). As can be seen from Fig. @] and Fig. [3] this ratio is less than 1 and it becomes larger 
when A increases (Tt becomes more correlated over t). 

We implemented ReProCS-cPCA using Algorithm [2] with a = 100, a ~ 200 and K = 15. The algorithm is not very sensitive 
to these choices. Also, we let £ = £t and ui — uj t vary with time. Recall that £ f is the upper bound on ||/3f ||2- We do not know 
(3 t . All we have is an estimate of f3 t from t — 1, $t-i — {I — Pt-iPt-ii^t-i- We used a value a little larger than ||/3 ( _i||2; we 
let £ 4 = 2||/3 t _ 1 || 2 . The parameter u> t is the support estimation threshold. One reasonable way to pick this is to use a percentage 
energy threshold of 5V ™ l37l . For a vector v, define the 99%-energy set of v as To. 99(1;) := {i : \vi\ > v°"} where the 99% 
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energy threshold, v - 99 , is the largest value of so that ||ut og ||| > 0.99||i>|||. It is computed by sorting \vi\ in non-increasing 
order of magnitude. One keeps adding elements to To. 99 until ||wt 99 ||| > 0.99||u|||. We used u t — 0.5(St jCS ) ". 

IX. Conclusions and Future Work 

We studied the problem of recursive sparse recovery in the presence of large but structured noise (noise lying in a "slowly 
changing" low dimensional subspace). We introduced the ReProCS with cluster-PCA (ReProCS-cPCA) algorithm that addresses 
some of the limitations of our earlier work on ReProCS lETI and of PCP [2|. Under mild assumptions, we showed that, w.h.p., 
ReProCS-cPCA can exactly recover the support set of St at all times; and the reconstruction errors of both St and L t are upper 
bounded by a time-invariant and small value at all times. In ongoing work, we are studying the undersampled measurements 
case. Open questions include (i) how to analyze a practical version of ReProCS-cPCA (which does not assume knowledge of 
signal model parameters), and (ii) how to study the correlated aj's case (e.g. the case where a t 's satisfy a linear random walk 
model). The starting point for (ii) would be to try to use the matrix Azuma inequality |26| instead of Hoefdding. 

Appendix A 
Proof of Lemma RTTI 

The proof follows by using the following three lemmas. 

Lemma A.l (Exponential decay of C^): Assume that all the conditions of Theorem 14. 1 1 hold. Let £+ = rQ. Define the series 
(k + as in Definition 15.31 Then, 

1) Co = 1 and C < 0.6 fe + 0.4cC for all k = 1, 2, . . . K, 

2) the denominator of Q" is positive for all k = 1, 2, . . .K. 

Proof: This lemma is the same as I12T1 Lemma 37] but with defined differently. ■ 
Lemma A.2 (Sparse recovery, support recovery and expression for et): Assume that all conditions of Theorem 14. 1 1 hold. 

1) If C* < C* + := < and Cfc-i < Cfc-i < O^ 1 + 0.4cC, then for all t G X j<k , for any k = 1, 2, . . . K, 

a) the projection noise j3 t satisfies ||A||2 < CfeliVc7new,fc + C*" \A"7* < \fcQ.72 k ~ 1 ~f nevi + 1.06V? < C 

b) the CS error satisfies ||5t, cs - Sth < 7£. 

c) f t = T t 

d) e t satisfies ® and ||e t || 2 < + [K+C^-i^7new,fe + CtV^l*} < 0-18 ■ 0.72 fe ^ 1 V^7new + 1.17 • 1.06V? 

2) For all k = 1,2, . ..K, P(T t = T t and e t satisfies ® for all t G l j>k \X j>k -i,o) = 1 for all X j>k -i fi € r j:fc _i, . 

3) For all k = 1,2, . . . K, P(f t = T t and e t satisfies © for all t G 2 J -, fc |r^_ 1 ) = 1. 

Proof: The first claim is the same as lETl Lemma 30] but with defined differently. The proof follows in an analogous 
fashion. The second claim follows from the first using Remark 16.31 The third claim follows using Lemma 11.41 ■ 
Lemma A.3 (High probability bound on ( k ): Assume that all the conditions of Theorem 14. 1 1 hold. Let = r£. Then, for 
all k = l,2,...K, 

P(Cfc<Cl r i,fc-i,o)>P*(«.C) 

where is defined in Definition 15.31 and p k (a,Q is defined in |2T1 Lemma 35]. 

Proof: Using Lemma IA. 11 (i) £q = 1 and C k -i ^ 0.6 fc_1 + 0.4c£ and (ii) the denominator of is positive. Using this 
and the theorem's conditions, the above lemma follows exactly as in [21 , Lemma 35]. The only difference is that is defined 
differently. Also, Tj ik := Tj, k .o- The proof proceeds by first bounding £/c (in a fashion similar to the bound in Lemma [7T6b ; 
using Lemma IA. 2 1 to get an expression for e t ; and finally using Corollaries II. 61 and II. 71 to get high probability bounds on each 
of the terms in the bound on (fc. ■ 

Proof of Lemma \6.1\ Lemma 16.11 follows by combining Lemma IA.3I and the third claim of Lemma IA.2I and using the 
fact that P(T^ fci0 |r| ifc _ lf0 ) = P(Cfc < Ck, ft = T t and e t satisfies © for all t G ^,fc|r| jfc _ li0 ). ■ 
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Appendix B 
Proof of Lemma [7T31 

Proof of Lemma \7.3\ 

1) The first claim follows because ||Aiet,fc||2 = ||*fc-iG dBt)fc ||2 = ||* fc _i[GiG 2 • • • G fc -i]|| 2 < Y^liW^ k-iG kl \\ 2 < 
Hk\=i W^kiGkAh = 12k7=i - Ylki^i^C < r C- The first inequality follows by triangle inequality. The second 
one follows because G\, • • • , G k -\ are mutually orthonormal and so ^k-i = rifc^i(-^ — G k2 G' k2 ). 

2) By the first claim, ||(J - Gd ety kG' Aet k )G de ^k\\2 = ||*k-iGdet,kll2 < r£. By item 2) of Lemma [TT2] with P = G det ,fc and 
P = G de t,fe, the result ||Gdet,fcG d et,fc' - G d et,fcG^ t fc ||2 < 2r( follows. 

3) Recall that D k = E k R k is a QR decomposition where E k is orthonormal and R k is upper triangular. Therefore, 
Oi(D k ) = (Ti(Rk). Since ||(7 - G de t,fcG det fc )G det ,fc||2 = ||*fc-iG det ,fc|| 2 < < and G' fc G de t,fc = 0, by item 4) of Lemma 
[TTJwith P = Gdet.fc, P = G det ,fe and Q = G fe , we have ^1 - r 2 C 2 < ^((J - G d eaG detfc )G fe ) = ^(D*) < 1. 

4) Since C fc ==* so ||£)undet,fc'-Efc||2 = \\D undet . k ' D k R k 1 1| 2 = ||GW,k / *fc_i*fc-iGfc.Rfc x || 2 = 
IIGundet.fe'^fc-iGfc^ 1 || 2 = \\G m fe^ k D k R k 1 || 2 = ||G un det,fc'-Efc||2- Since E k — D k R k 1 = (I — Gd e ^ k G' det k )G k R k , 

||G un det,fc'-E'fc||2 = ||G un det,fc'(^ — Gd et , k G' detk )G k R k 1 \\2 

< \\G mdet , k '(I - Gdet, fc G^ !fc )G fc || 2 (l/v/l - r 2 C 2 ) - ||G imd et, fe 'G dea G^ etife G fc || 2 (l/v / l - r 2 C 2 ) 

By item 3) of Lemma [TT2] with P = G det ,fc, P = G det ,fe and Q = G unAeUk , we get ||Gundet,fe'Gdet,fe||2 < r(. By item 3) 
of Lemma [TTJwith P = G det ,fc and Q = G fc , we get ||G^ t fc G fc || 2 < <. Therefore, ||G un det,fc'-B fe || 2 = ||£ fc 'G unde t.fc||2 < 

r 2 C 2 
\/l-r 2 C 2 ' 
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